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1  Overview 


Under  this  DARPA  contract  the  University  of  Rochester  developed  and  disseminated 
papers,  ideas,  algorithms,  analysis,  software,  applications,  and  implementations  for 
parallel  programming  environments  for  computer  vision  and  for  vision  applications. 
The  work  has  been  widely  reported  and  highly  influential.  The  investigators  have  been 
awarded  several  honors.  Faculty  members  involved  have  received  several  prestigious 
honors,  including  an  IBM  Faculty  Development  Award  for  Michael  Scott  and  an  ONR 
Young  Investigator  Award  for  Tom  LeBlanc.  We  were  awarded  a  DARPA  Parallel 
Systems  postgraduate  fellowship.  We  have  won  several  Best  Paper  awards.  From 
1984  to  1988  the  department  produced  approximately  400  papers,  more  than  half 
of  which  are  in  refereed  conferences  and  journals.  There  have  been  14  completed 
Ph.D.  theses  directly  related  to  parallel  programming  environments  and  vision 
applications,  and  approximately  ten  more  such  theses  are  in  progress. 

The  most  significant  work  centered  on  the  Butterfly  Parallel  Processor,  the  MaxVideo 
pipelined  parallel  image  processor,  and  the  development  of  the  real-time  computer 
vision  laboratory.  For  the  Butterfly,  the  Psyche  multi-model  operating  system  was 
developed  (as  well  as  two  other  experimental  operating  systems),  and  the  CONSUL 
autoparallelizing  compiler  was  designed  (and  the  Lynx  language  compiler  ported). 
Much  basic  and  influential  performance  monitoring  and  debugging  work  was  com¬ 
pleted,  resulting  in  working  systems  and  novel  algorithms.  There  was  also  significant 
research  in  systems  and  applications  using  the  other  parallel  architecture  in  the  lab¬ 
oratory,  the  MaxVideo  parallel  pipelined  image  processor.  As  a  part  of  the  DARPA 
contract,  we  developed  a  heterogeneous  parallel  architecture  involving  pipelined  and 
MIMD  parallelism,  and  integrated  it  with  a  high  performance  9  degree  of  freedom 
robot  head.  The  hardware  of  the  laboratory  is  described  in  the  next  section. 

Early  in  the  contract  period,  Rochester  demonstrated  SIMD-like  programs  on  the 
BBN  Butterfly  Parallel  Processor  that  show  linear  parallel  speedup.  Many  appli¬ 
cations  for  the  image  processing  pipeline  (including  tracking,  color  histogramming, 
feature  detection,  frame-rate  depth  maps,  frame-rate  time-to-collision  maps,  large- 
scale  correlations,  segmentation  using  motion  blur,  and  others)  have  been  written. 
The  efficacy  of  intimate  cooperation  between  vision  computations  and  controlled  mo¬ 
tion  has  been  demonstrated.  This  work  has  attracted  national  attention  and  won 
international  prizes.  The  Zebra  object-oriented  system  for  Datacube  programming 
was  developed,  and  the  Zed  menu  editor  built  on  top  of  Zebra.  These  programming 
environments  are  useful  for  any  register-level  devices,  and  are  a  considerable  improve¬ 
ment  on  previous  Datacube  environments.  They  are  being  made  available  to  all  by 
anonymous  ftp. 

Programming  MIMD  applications  is  difficult,  and  Rochester  is  a  leader  in  devel¬ 
oping  operating  systems  (PSYCHE),  performance  monitoring  (PPUTTS)  and  debug¬ 
ging  (INSTANT  REPLAY)  tools  to  make  the  job  easier.  The  PLATINUM  system 
solves  automatically  many  of  the  problems  (code  and  data  replication  and  cacheing) 
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in  getting  SIMD-like  programs  to  run  efficiently  on  Non  Uniform  Memory  Access 
architectures  (such  as  hypercubes,  Butterfly,  Encore,  etc.).  The  M1MD  program  de¬ 
velopment  tools  (PPUTTS,  Instant  Replay,  and  Moviola)  provide  several  graphical 
views  and  a  LISP  interface  to  a  multi-process,  multi-processor  application.  The  sys¬ 
tem  provides  repeatable  single-stepping,  statistics,  symbolic  debugging,  and  other 
“traditional”  debugging  techniques  that  have  not  previously  been  available  to  par¬ 
allel  programmers.  This  work  has  produced  many  influential  papers,  several  prizes, 
and  the  operational  systems. 

At  the  end  of  the  contract  period  the  PSYCHE  operating  system  was  operational, 
and  is  currently  supporting  multi-agent  applications,  and  multi-model  (e.g.  both 
threads  and  heavyweight  processes)  programming  environments.  PSYCHE  has  been 
used  to  support  five  independent  processes  controlling  the  bouncing  of  a  tethered  bal¬ 
loon  with  a  paddle  -  this  hybrid  system  uses  pipelined  parallelism  from  the  MaxVideo 
system  for  low  level  visual  input.  As  a  result  of  the  DARPA  contract,  we  are  now  de¬ 
veloping  plans  (the  ARMTRAK  system)  for  integrating  pipelined  parallelism,  MIMD 
parallelism  with  multiple  computational  models  and  sequential  planning  parad:-yns 
to  manage  a  dynamic  model  railroad  system. 

Rochester  has  implemented  object  recognition  algorithms  in  neural  nets,  and  de¬ 
veloped  hardware  realizations  for  the  resulting  constraint-propagation  networks.  The 
domain  includes  large  sets  of  objects,  and  uses  Bayesian  techniques  to  handle  par¬ 
tial  and  incomplete  information.  The  Rochester  Connectionist  Simulator  and  the 
Zebra/Zed  systems  are  available  by  anonymous  ftp.  Together  they  have  been  dis¬ 
tributed  to  several  hundred  sites  worldwide. 

This  final  report  starts  with  a  quick  guide  to  key  papers  that  have  been  produced 
over  the  years,  and  then  in  turn  briefly  outlines  the  Laboratory,  the  work  in  operating 
systems,  languages,  utilities,  performance  monitoring,  pipelined  parallelism,  parallel 
computer  vision  applications,  integration  of  a  cognitive  layer  into  the  system,  and 
technology  transfer  issues.  Finally,  a  list  of  theses  produced  under  the  contract  is 
included.  More  detail  is  available  from  the  papers  in  the  literature,  and  extensive 
references  are  provided. 


2  Key  Reports  by  Topic 

This  section  briefly  points  out  key  reports.  More  detail  on  most  of  these  projects 
appears  in  later  sections  of  this  final  report.1 

1  References  by  number  (e.g.  [1])  are  found  as  numbered  references  in  the  Bibliography.  References 
by  name  and  date  (e.g.  [Ballard  1990])  are  found  by  name  in  one  of  the  publications  lists. 
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2.1  Laboratory  for  Parallel  Vision  Research 

During  the  contract  period,  Rochester  developed  and  commissioned  a  binocular  robot 
head,  acquired  and  commissioned  a  multiple  degree- of- freedom  platform  for  the  3-dof 
robot  head,  and  acquired  a  real-time,  pipelined  parallel  image  processing  engine. 
The  laboratory  allows  us  to  test  our  systems  concepts  in  a  complex,  visuo-motor 
real-time  environment.  Software  integration  is  important  a s  well:  PSYCHE’s  first 
application  will  be  to  manage  the  higher-level  data  structures  (e.g.  the  world  model) 
in  an  integrated  parallel  vision  system  that  also  uses  the  pipelined  parallelism  of  the 
frame-rate  MaxVideo  image  processing  system.  The  key  reports  are  [Brown  et  al. 
1988  (Rochester  Robot);  Ballard  1990  (Animate  Vision);  Ballard  et  al.  1987  (Eye 
Movements);  Brown  and  Rimey  1988  (Coordinate  systems,  kinematics...);  Brown 
1988  (Parallel  Vision  with  the  Butterfly);  Brown  1989a  (Gaze  Control)]. 

2.2  Parallel  Hardware  and  Programming  Languages 

Throughout  the  contract  period  Rochester  has  kept  pace  with  the  technical  develop¬ 
ments  of  the  Butterfly  product  line  of  BBN-ACI.  We  have  owned  three  generations 
of  Butterfly  computers,  including  one  of  the  largest  ever  sold.  Much  of  our  research 
transcends  any  particular  piece  of  hardware,  though  its  implementation  of  course 
requires  intimate  familiarity  with  particular  hardware. 

Languages  for  MIMD  parallel  computers  have  been  developed  and  ported  under 
the  contract,  and  quantitative  comparisons  made  between  programming  models.  A 
library  for  programming  the  MaxVideo  pipeline  parallel  image  analysis  hardware  has 
also  been  developed.  The  key  reports  are  [Baldwin  1989  (Consul);  Baldwin  and  Quiroz 
1987  (Parallel  programming);  LeBlanc  et  al.  1988  (Large-scale  parallel  programming); 
Scott  et  al.  1990  (Multi-model  parallel  programming);  Crowl  1989  (A  uniform  object 
model);  Tilley  1989  (Zebra  for  MaxVideo)]. 

2.3  Parallel  Programming  Environment  -  Operating  Sys¬ 
tems 

Three  operating  systems  (Elmwood,  Platinum,  Psyche)  have  been  developed  for  the 
Butterfly.  The  most  ambitious  project  is  Psyche,  though  Platinum  solves  automati¬ 
cally  a  number  of  problems  that  users  face  when  using  Uniform  System-style  program¬ 
ming  on  a  MIMD  computer  (Automatic  cacheing  and  data  migration,  for  instance). 
The  key  papers  are  [Scott  et  al.  1989b, c  (Psyche  description);  LeBlanc  et  al.  1989b 
(Elmwood  description);  Cox  and  Fowler  1989  (Platinum  description)]. 
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2.4  Parallel  Programming  Environment  -  Utilities  and  Li¬ 
braries 

Along  with  languages  and  operating  systems,  Rochester  produced  systems  utilities  for 
communication,  file  systems,  and  compilers.  They  span  a  broad  range  from  parallel 
file  systems  through  new  languages  for  expressing  parallel  computation.  Applications 
packages  such  a s  the  current  version  of  the  neural  net  simulator  and  the  image- 
processing  utilities  allow  speedups  of  up  to  a  factor  of  100  over  single- workstation 
implementations.  User  interfaces  to  large  multiprocessor  computers  are  a  difficult 
issue  addressed  by  Yap’s  work,  and  many  of  the  packages  extend  the  range  of  com¬ 
putational  models  available  to  a  user.  For  instance,  the  Ant  Farm  project  provides 
capability  we  noticed  we  needed  after  the  first  DARPA  Parallel  Architectures  Bench¬ 
mark  and  Workshop,  namely  the  ability  to  support  many  lightweight  processes.  The 
key  papers  are  [  Scott  and  Jones  1988  (Ant  Farm);  Dibble  and  Scott  1989a, b  (Bridge 
file  system);  Bolosky  et  al  1989  (memory  management  techniques);  Goddard  et  al. 
1989  (Connectionist  simulator);  LeBlanc  and  Jain  1987  (Crowd  control);  Yap  and 
Scott  1990  (PenGuin)]. 

2.5  Parallel  Programming  Environment  -  Performance  Mon 
itoring 

Debugging  and  performance  monitoring  in  an  MIMD  environment  are  significantly 
more  difficult  than  on  a  uniprocessor.  Rochester  contributed  many  results  over  the 
course  of  the  contract.  The  instant  replay  system  allows  normal  cyclic  debugging 
in  a  nondeterministic  parallel  environment  by  keeping  a  log  of  interactions  between 
processes.  Moviola  is  a  suite  of  interactive  performance  monitoring  tools.  The  key 
papers  are  [LeBlanc  and  Mellor-Crummey  1987  (Instant  Replay);  Fowler  et  al.  1988, 
LeBlanc  et  al.  1990  (Moviola)]. 

2.6  Vision  Applications 

Vision  applications  are  an  important  part  of  our  work,  but  are  only  indirectly  sup¬ 
ported  by  the  contract,  which  views  applications  as  potential  users  of  the  parallel 
systems  we  are  developing.  For  example,  Paul  Chou’s  work  used  the  Markov  Ran¬ 
dom  Field  formulation  for  intermediate-level  vision  and  produced  results  that  have 
been  quantified  and  are  better  them  any  other  known  techniques.  We  have  ported 
his  evidence-combination  to  the  Butterfly,  where  it  runs  as  a  set  of  three  cooperating 
agents  under  Tom  LeBlanc’s  SMP  system.  As  another  example,  the  work  of  Cooper 
and  Swain  is  being  ported  to  the  Connection  Machine  at  the  University  of  Syracuse’s 
DARPA-funded  NPAC.  Object  recognition,  inference,  quantification  of  performance 
in  biologically  oriented  neural,  net  computational  techniques,  and  hardware  for  relax¬ 
ation  computations  have  all  been  under  active  study. 
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Several  parallel  vision  applications  were  pursued,  including  Butterfly  program¬ 
ming,  Markov  Random  Field  and  connectionist  research,  and  work  aimed  at  inte¬ 
grating  the  real-time  laboratory  and  using  it  for  complex  planning  tasks  that  in¬ 
clude  sensing  and  acting.  Key  papers  are  [Feldman  et  al.  1988a, b;  Feldman  1987 
(Basic  connectionism);  Simard  et  al.  1988  (Recurrent  backpropagation);  Porat  and 
Feldman  1988  (Learning  theory);  Olson  et  al.  1987  (Vision  on  butterfly);  Ballard 
and  Ozcandarli  1988  (Kinetic  depth  calculations);  Brown  et  al.  1989a  (decentral¬ 
ized  Kalman  filters);  Aloimonos  and  Brown  1988  (Robust  computation  of  intrinsic 
images);  Chou  and  Brown  1988  (Sensor  fusion,  reconstruction  and  labeling);  Wix- 
son  and  Ballard  1990  (Color  histograms);  Rimey  and  Brown  1990  (Hidden  Markov 
models);  Yamauchi  1989  (Juggler);  Nelson  1990  (Flow  fields);  Cooper  1988  (Struc¬ 
ture  recognition);  Sher  1987a, b,c  (Probabilistic  low-level  vision);  Swain  1988  (Object 
recognition  from  large  database);  Swain  and  Cooper  1988  (Parallel  hardware  for  recog¬ 
nition);  Martin,  Brown,  and  Allen  1990  (ARMTRAK  project);  Allen  and  Hayes  1985 
(Theory  of  time),  Allen  1989  (Representing  time)). 


3  The  Laboratory 

The  Rochester  Robotics  Laboratory  has  developed,  during  the  years  of  the  DARPA 
contract,  from  a  single  drum-scanner  to  the  configuration  described  in  this  section.  It 
currently  consists  of  four  key  components  (Fig.  1):  a  “head”  containing  cameras  for 
visual  input,  a  robot  arm  that  supports  and  moves  the  head,  a  special-purpose  par¬ 
allel  processor  for  high-bandwidth,  low-level  vision  processing,  and  a  general-purpose 
parallel  processor  for  high-level  vision  and  planning.  This  unique  design  allows  for 
visuo-motor  exploration  over  an  800  cubic  foot  workspace,  while  also  providing  huge 
computing  and  power  resources.  Thus,  we  do  not  suffer  the  communication  and  power 
limitations  of  most  mobile  platforms. 

The  robot  head  (shown  in  Fig.  2)  built  as  a  joint  project  with  the  University’s 
Mechanical  Engineering  Department,  has  three  motors  and  two  CCD  high-resolution 
television  cameras  providing  input  to  a  MaxVideo  digitizer  and  pipelined  image- 
processing  system.  One  motor  controls  pitch  or  altitude  of  the  two-eye  platform, 
and  separate  motors  control  each  camera’s  yaw  or  azimuth,  providing  independent 
“vergence”  control.  The  motors  have  a  resolution  of  2,500  positions  per  revolution 
and  a  maximum  speed  of  400  degrees/second.  The  controllers  allow  sophisticated 
velocity  and  position  commands  and  data  read-back. 

The  robot  body  is  a  PUMA761  six  degree-of-freedom  arm  with  a  two  meter  radius 
workspace  and  a  top  speed  of  about  one  meter/second.  It  is  controlled  by  a  dedi- 
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cated  LSI- 11  computer  implementing  the  proprietary  VAL  execution  monitor  and 
programming  interface. 

The  MaxVideo  system  consists  of  several  independent  boards  that  can  be  cabled 
together  to  achieve  many  frame-rate  image  analysis  capabilities:  digitizing,  storage, 
and  transmission  of  images  and  sub-images,  8x8  or  larger  convolution,  pixel-wise 
image  processing,  cross-bar  image  pipeline  switching  for  dynamic  reconfiguration  of 
the  image  pipeline,  look-up  tables,  histogramming  and  feature  location.  A  digital 
signal  processing  computer  on  one  board  can  perform  arbitrary  computations,  and 
also  has  a  high  speed  image  bus  interface  and  a  VME  bus  master  interface  so  it  can 
program  the  oth :-r  boards  in  the  same  manner  as  the  host.  The  MaxVideo  boards  are 
all  register  programmable  and  are  controlled  by  the  Butterfly  or  Sun  via  VME  bus. 

A  unique  feature  of  our  laboratory,  one  crucial  for  our  future  research,  is  the  ca¬ 
pability  to  use  a  multiprocessor  as  the  central  computing  resource  and  host.  Our 
Butterfly  Plus  Parallel  Processor  contains  28  nodes,  each  consisting  of  an  MC68020 
processor,  MC68851  MMU,  MC68881  FPU,  and  4  MBytes  of  memory.  The  Butter¬ 
fly  is  a  shared-memory  multiprocessor  with  non-uniform  memory  access  times;  each 
processor  may  directly  access  any  memory  in  the  system,  but  with  approximately 
15  times  greater  latency.  The  Butterfly  has  a  VME  bus  connection  that  mounts  in 
the  same  card  cage  as  the  MaxVideo  and  motor  controller  boards.  Currently,  a  SUN 
workstation  acts  as  a  host  system  for  the  lab.  As  software  develops  on  the  Butterfly, 
we  plan  to  migrate  functionality  from  the  workstation  host  to  the  Butterfly. 

The  DARPA  contract  supported  development  of  software  for  the  two  parallel 
computing  engines  in  this  laboratory,  the  Butterfly  and  the  MaxVideo,  and  of  relevant 
applications  and  principles. 


4  Languages  for  Parallel  Computation:  CONSUL, 
an  Auto-Parallelizing  Compiler 

Constraint  languages  view  programs  as  systems  of  equations  to  be  solved,  rather  than 
sequences  of  commands  to  be  executed.  This  section  describes  the  current  state  of 
research  on  CONSUL,  an  experimental  constraint  language  for  programming  multi¬ 
processors.  We  cover  the  design  of  the  language,  formal  results  on  a  heuristic  for 
executing  it,  experimental  evidence  that  this  heuristic  works  well  in  practice  and  po¬ 
tentially  provides  lots  of  parallelism,  and  initial  work  on  the  design  of  multiprocessor 
implementations. 
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Figure  2:  A  multi-exposure  photograph  of  the  Rochester 
robot  in  action.  The  arm  is  the  largest  industrial  arm 
on  the  market,  while  the  unique  head  was  designed  by 
Rochester's  Computer  Science  and  Mechanical  Engineering 
Departments. 


4.1  Introduction 


Designing  parallel  programs  is  generally  considered  to  be  much  harder  than  designing 
sequential  ones.  This  feeling  is  based  on  the  observation  that  making  multiple  proces¬ 
sors  cooperate  in  solving  a  problem  raises  issues  that  simply  do  not  exist  in  sequential 
programming:  partitioning  of  programs  and  data  among  processors,  process  synchro¬ 
nization,  communication  between  processes,  etc.  Programming  languages  that  shield 
programmers  as  much  as  possible  from  the  problems  of  parallelism  are  an  obvious 
tool  for  making  parallel  processing  easier  to  use.  This  section  describes  the  CONSUL 
project  at  the  University  of  Rochester,  a  project  that  is  studying  the  use  of  constraint 
languages  for  parallel  programming. 

This  research  concentrates  on  implicitly  parallel  programming  of  fairly  tightly  cou¬ 
pled  MIMD  multiprocessors.  In  “implicitly  parallel”  programming,  that  compilers  or 
run  time  software,  not  programmers,  are  responsible  for  finding  and  exploiting  par¬ 
allelism.  “Fairly  tightly  coupled  MIMD  multiprocessor”  denotes  an  MIMD  machine 
containing  a  few  tens  to  a  few  hundreds  of  processors,  with  data  at  a  remote  proces¬ 
sor  accessible  at  some  small  multiple  of  the  cost  of  a  local  memory  reference  (up  to 
a  few  tens  of  times  as  expensive).  Throughout  this  section,  the  term  “multiproces¬ 
sor”  refers  to  an  architecture  of  this  type.  This  research  addresses  general  purpose 
parallelism,  i.e.,  parallelism  that  can  be  exploited  in  a  variety  of  applications.  The 
choice  of  implicit  parallelism  and  MIMD  multiprocessors  represents  the  setting  that 
is  best  suited  to  this  generality.  It  should  be  possible,  however,  to  apply  the  ideas 
discussed  here  to  other  parallel  architectures  and  to  languages  in  which  programmers 
have  more  control  over  parallelism. 

Constraint  languages  are  generalizations  of  logic  programming  languages.  Par¬ 
allelism  in  constraint  languages  and  logic  programming  languages  is  a  consequence 
of  the  mathematical  logic  that  underlies  both.  However,  many  important  kinds  of 
computation  (e.g.,  arithmetic  or  input  and  output)  do  not  have  efficiently  executable 
definitions  in  pure  logic.  Practical  logic  programming  languages  include  a  number 
of  extra-logical  features  to  support  such  computations.  For  example,  most  logic  pro¬ 
gramming  languages  define  the  precise  order  in  which  goals  in  a  clause  and  clauses  in  a 
procedure  will  be  tested,  and  arithmetic  and  input/output  functions  often  rely  on  this 
order  to  produce  meaningful  results,  “predicates”  with  side  effects  are  often  present, 
etc.  Because  they  express  very  common  computations,  these  extra- logical  features 
are  heavily  used  in  practical  logic  programs.  Unfortunately,  the  mathematical  bases 
for  parallelism  that  apply  to  pure  logic  (e.g.,  and-  and  or-parallelism,  freedom  from 
side-effects,  etc.)  do  not  apply  to  the  extra-logical  features,  so  real  logic  programs 
are  hard  to  parallelize.  Constraint  languages  attack  this  problem  by  offering  primi¬ 
tives  with  clean  logical  semantics  for  the  offending  computations.  Implementations 
can  respect  these  logical  semantics,  and  thus  provide  the  same  parallelizations  as  are 
applicable  to  the  rest  of  the  language,  without  incurring  the  inefficiencies  of  actually 
executing  the  full  logical  definitions. 
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The  key  to  generalizing  logic  programming  into  constraint  programming  is  to 
think  of  a  program  as  a  system  of  constraints  on  the  allowable  values  of  variables. 
“Executing”  the  program  consists  of  finding  values  for  the  variables  such  that  all  of 
the  constraints  are  satisfied.  In  a  logic  programming  language,  the  constraints  are 
described  by  user  defined  predicates.  Constraint  languages  provide  a  richer  set  of 
built  in  constraints,  generally  based  on  some  higher  level  of  mathematical  notation 
than  pure  logic.  For  example,  a  language  based  on  algebra  of  real  numbers  might 
treat  constraints  such  as  UX  =  Y  +  Z”,  “X  —  Y  xZ",  and  so  on  as  primitives.  A 
crucial  feature  of  these  primitives  is  that  their  semantics  is  consistent  with  that  of 
any  other  constraint  expressible  in  the  language.  For  example,  a  primitive  arithmetic 
constraint  can  be  solved  for  the  value  of  any  of  its  arguments,  given  values  for  the 
other  two.  This  contrasts  with  the  built  in  treatment  of  arithmetic  in  languages  such 
as  Prolog,  which  only  “solves”  arithmetic  relations  in  a  single  direction.  To  give 
a  precise  definition,  a  constraint  language  is  one  in  which  programs  are  systems  of 
relations,  and  program  execution  consists  of  solving  the  relations  for  the  values  of 
any  variables  appearing  in  them.  The  logic  programming  languages  are  thus  a  proper 
subset  of  the  constraint  languages.  Also  note  that  languages  such  as  CLP  [13],  which 
are  often  described  as  “logic  programming  languages”,  really  lie  in  a  larger  class  of 
constraint  language. 

One  of  the  first  constraint  languages  outside  of  the  logic  programming  subset 
was  described  by  Steele  [21].  This  language  demonstrated  the  potential  of  constraint 
languages,  but  was  fax  from  being  a  general  purpose  programming  language.  More 
recently,  Leler  described  a  schema  for  generating  full  fledged  constraint  languages 
based  on  term  rewriting  [14].  Finally,  there  has  been  a  flurry  of  interest  in  constraint 
languages  from  the  logic  programming  community  [9,  10, 13].  This  work  has  taken  the 
form  of  extending  the  existing  state  of  the  art  in  logic  programming  languages  with 
additional  constraint  satisfaction  heuristics.  No  one,  from  Steele  on,  has  published 
in  depth  descriptions  of  parallel  constraint  programming.  Both  Steele  and  Leler 
mentioned  that  it  should  be  possible,  but  did  not  pursue  the  idea.  Although  a  number 
of  approaches  to  parallel  logic  programming  have  been  suggested  [6,  8,  19],  they  have 
been  separate  from  work  on  general  constraint  languages. 

4.2  CONSUL  Programming  Language 

The  centerpiece  of  the  CONSUL  project  is  the  programming  language  CONSUL  [2]. 
This  language  is  the  source  language  accepted  by  the  interpreters  (and  eventually 
compilers)  being  developed,  and  is  the  language  in  which  all  of  the  samples  used  in 
experiments  are  written.  CONSUL  is  also  a  vehicle  for  testing  ideas  about  the  design 
of  general  purpose  constraint  languages.  Given  these  roles,  CONSUL  is  very  much 
an  experimental  prototype  of  a  constraint  language.  It  supports  demonstrations  of 
realistic  programs  for  a  variety  of  applications,  but  in  a  laboratory  setting  rather  than 
industrial  production. 


11 


(1)  (defrel  Knapsack  (Ints  Sum) 

(2)  (exists  ((Sub-Ints  (power-set  Ints))) 

(3)  (Total  Sum  Sub-Ints)) 

(4)  (defrel  Total  (Sum  Ints) 

(5)  (or  (and  (equal  Ints  empty) 

(6)  (equal  Sum  0)) 

(7)  (exists  ((I  Ints) 

(8)  (New-Sum  integer) 

(9)  (New-Ints  (power-set  integer))) 

(10)  (and  (set-minus  New-Ints  Ints  (set  I)) 

(11)  (minus  New-Sum  Sum  I) 

(12)  (Total  New-Sum  New-Ints))))))) 


Figure  3:  CONSUL  Program  for  the  Knapsack  Problem 

The  central  ideas  behind  CONSUL  are  demonstrated  by  the  program  in  Figure  3. 
This  program  solves  the  0-1  Knapsack  problem,  namely,  given  a  set  of  integers  (“Ints” 
on  line  1),  find  a  subset  of  it  (“Sub-Ints”,  line  2)  whose  members  sum  to  a  given  value 
(“Sum”).  The  core  of  the  CONSUL  solution  appears  on  lines  1  through  3.  These  lines 
really  just  rephrase  the  problem  specification  using  CONSUL  syntax.  Line  1  defines 
a  user  defined  relation  called  “Knapsack”,  whose  arguments  are  the  parameters  of 
the  problem.  Lines  2  and  3  assert  that  for  the  problem  to  be  solvable  there  must  be 
a  subset  of  “Ints”  that  sums  to  “Sum”.  Lines  4  through  12  define  what  it  means  for 
a  set  of  integers  to  sum  to  a  value.  The  definition  is  recursive,  stating  that  set  “Ints” 
sums  to  “Sum”  either  if  “Ints”  is  empty  and  “Sum”  is  zero,  or  if  there  is  some  element 
of  “Ints”  that  can  be  removed  from  it  to  yield  a  new  set  of  integers  and  subtracted 
from  “Sum”  to  yield  a  new  sum,  such  that  the  new  set  sums  to  the  new  sum. 

The  simplest  CONSUL  statements  are  assertions  that  some  relation  holds  between 
one  or  more  values.  These  assertions  have  an  s-expression  syntax,  with  the  general 
form  “( relation  valuei . . .  va/ue„)”.  The  relation  used  in  one  of  these  forms  can  either 
be  built  in  to  CONSUL  or  it  can  be  defined  by  the  user.  Examples  of  built-in  relations 
appear  on  lines  5,  6,  10,  and  11  of  the  Knapsack  program.  These  built-ins  assert, 
respectively,  that  two  values  are  equal  (lines  5  and  6),  that  the  first  is  the  set  difference 
of  the  second  and  third  (line  10),  and  that  the  first  value  is  the  integer  difference  of 
the  next  two  (line  11).  Uses  of  a  user  defined  relation  are  shown  on  lines  3  and 
12.  Lines  1  and  4  introduce  user  defined  relations,  declaring  their  names  and  formal 
parameters. 

Applications  of  individual  relations  can  be  combined  into  blocks  by  the  connectives 
“and”,  “or”,  and  “not”,  and  the  quantifiers  “forall”  and  “exists”.  These  connectives 
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and  quantifiers  have  the  expected  logical  meanings.  See  lines  2,  5,  7,  and  10  of  the 
Knapsack  program  for  typical  uses  of  connectives  and  quantifiers.  Quantifiers  have 
the  general  syntax 

(Quantifier  (  (Name \  Set\) 

(Namen  Setn)  ) 

Body) 


Each  of  the  namei  names  a  variable  that  is  quantified  over  the  corresponding  set{. 
Body  is  an  arbitrary  CONSUL  form,  which  must  be  satisfied  for  the  quantifier  as 
a  whole  to  be  satisfied.  Whether  the  body  must  be  satisfied  for  all  values  of  the 
quantified  variables  or  only  some  depends  on  whether  quantifier  is  “forall”  or  “exists”. 
The  scope  of  the  names  introduced  by  a  quantifier  is  limited  to  the  body  of  the 
quantifier.  CONSUL  does  not  require  that  the  forms  joined  by  an  “and”  or  “or” 
be  satisfied  in  any  particular  order,  nor  does  it  require  that  the  possible  values  of 
quantified  variables  be  tested  in  any  particular  order.  Thus  “and”  and  “or”  introduce 
and-  and  or-parallelism  into  programs,  and  quantifiers  introduce  data  parallelism. 

CONSUL’S  formal  basis  is  in  set  theory.  Set  theory  seems  to  be  a  good  pragmatic 
base  for  a  parallel  programming  language,  since  it  provides  a  basic  data  structure  that 
is  inherently  unordered.  Consequently  there  is  no  semantic  reason  why  operations  on 
this  data  structure  should  be  done  in  any  particular  order.  Not  surprisingly,  sets  and 
relations  between  them  play  an  important  role  in  CONSUL.  The  Knapsack  program 
demonstrates  that  variables  can  be  bound  to  sets  as  well  as  to  scalars  (see  “Ints” 
in  both  the  “Knapsack”  and  “Total”  relations).  The  example  also  demonstrates  one 
of  the  built  in  relations  between  sets  (“set-minus”,  line  10).  CONSUL  provides  the 
empty  set,  the  integers,  and  the  printing  ASCII  character  set  as  built  in  set  con¬ 
stants.  These  sets  are  denoted  by  the  names  “empty”,  “integer”,  and  “character” 
respectively.  See  lines  5,  8,  and  9  for  examples  of  their  use.  Programmers  can  build 
more  sophisticated  sets  using  so-called  “set  constructors”.  Lines  2  and  9  of  the  Knap¬ 
sack  program  show  uses  of  the  “power-set”  set  constructor  and  line  10  shows  a  use  of 
the  “set”  set  constructor.  “Power-set”  represents  the  power  set  of  its  argument;  “set” 
represents  the  set  containing  its  arguments.  Note  that  set  constructors  in  CONSUL 
are  not  relations.  They  are  descriptions  of  values,  much  as  constants  and  variable 
names  are. 

One  important  kind  of  set  not  demonstrated  by  the  Knapsack  program  is  the 
sequence.  Sequences  are  the  main  tool  in  CONSUL  for  describing  cases  in  which 
order  matters  (for  example,  to  say  that  outputs  must  appear  in  the  same  order  as  the 
inputs  from  which  they  are  computed).  Formally,  a  sequence  is  a  set  whose  elements 
are  special  pairs  containing  an  integer  index  (the  position  of  the  pair  in  the  sequence) 
and  an  arbitrary  datum  (the  actual  element  value).  Since  sequences  axe  sets,  all  of 
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the  usual  parallel  set  operations  can  be  applied  to  them.  By  referring  to  specific 
positions  in  the  sequence  by  index,  however,  programmers  can  also  impose  an  order 
on  its  elements. 

To  summarize,  CONSUL  has  a  number  of  features  that  seem  to  suit  it  to  parallel 
programming.  Among  these  features  are  5 

•  The  semantics  of  the  control  structures  (logical  connectives  and  quantifiers)  and 
the  main  data  structure  (sets)  are  independent  of  execution  order. 

•  Being  purely  declarative,  primitives  and  user-defined  relations  cannot  have  side- 
effects  that  make  their  meaning  depend  on  execution  order. 

•  Sequences  provide  a  way  for  programmers  to  describe  sequential  ordering  when 
it  is  important  without  unnecessarily  reducing  parallelism. 

Furthermore,  CONSUL  supports  a  range  of  problem  description  styles,  from  highly 
declarative  to  highly  algorithmic.  For  example,  the  “Knapsack”  relation  in  Figure  3 
is  a  very  declarative  translation  of  the  0-1  Knapsack  problem  specification  into  CON¬ 
SUL.  It  says  very  little  about  an  exact  algorithm  for  solving  the  problem.  On  the 
other  hand,  “Total”  is  a  specific  iterative  algorithm  for  summing  a  sequence  of  num¬ 
bers.  Such  stylistic  flexibility  is  important  to  declarative  programming,  parallel  or 
not,  because  it  lets  programmers  control  program  efficiency  through  proper  choice  of 
algorithms  without  leaving  the  declarative  framework.  Note  that  none  of  these  fea¬ 
tures  is  necessarily  limited  to  CONSUL  —  other  constraint  languages  do  not  include 
the  exact  combination  of  features  that  CONSUL  does,  but  there  is  no  reason  why 
they  couldn’t.  Although  these  features  lend  credence  to  the  claim  that  constraint 
languages  are  good  tools  for  general  purpose  parallel  programming,  they  do  not  by 
themselves  prove  it.  The  current  status  of  the  search  for  firmer  proof  is  discussed  in 
the  following  section. 

4.3  Status 

The  main  ideas  underlying  CONSUL  were  conceived  in  late  1985  and  early  1986. 
Since  then,  work  has  progressed  on  a  number  of  fronts.  The  main  areas  of  research 
have  been: 

•  Design  of  the  CONSUL  language, 

•  Formed  study  of  the  constraint  satisfaction  problem  and  a  heuristic  for  solving 
it, 

•  Uniprocessor  implementation  of  this  heuristic  and  experimented  characterization 
of  its  performance,  including  estimates  of  parallelism,  and 
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•  Initial  work  on  a  multiprocessor  implementation  of  CONSUL. 

The  main  results  and  future  plans  in  each  of  these  areas  are  summarized  in  the 
following  sections. 

4.3.1  The  CONSUL  Language 

The  overall  design  of  CONSUL  was  described  in  Section  4.2.  This  design  has  been 
fairly  stable  since  late  1986.  The  only  changes  have  been  minor  ones,  including 
occasional  addition  of  new  primitives,  fixing  ambiguities  in  the  language’s  semantics, 
and  improving  its  approach  to  input  and  output. 

A  number  of  programs  have  been  written  in  CONSUL.  These  programs  range  in 
length  from  a  few  lines  to  4  or  5  pages.  Some  examples  are  listed  in  Table  1.  These 
programs  are  taken  from  a  variety  of  sources  and  application  areas.  The  application 
areas  sampled  include  numeric  computing,  databases,  text  handling,  combinatorial 
search  problems,  etc.  Most  of  the  original  problem  specifications  come  from  within 
the  CONSUL  project.  A  few,  however,  are  taken  from  outside  sources.  These  include 
Puzzle,  which  is  taken  from  a  classic  Prolog  demonstration,  and  Hotel,  Median,  and 
both  versions  of  Rationals,  which  are  taken  from  Pascal  assignments  used  in  an  in¬ 
troductory  computer  science  course.  Some  of  the  problems  developed  by  CONSUL 
project  members  were  intended  to  demonstrate  CONSUL’S  strengths  (Lexer,  Knap¬ 
sack,  and  XC),  but  others  were  deliberately  designed  to  test  possible  weaknesses  of 
the  language  (notably  Database). 

The  variety  of  programs  that  have  been  written  in  CONSUL  demonstrates  that 
the  goal  of  producing  a  general  purpose  laboratory  language  has  been  met.  Several 
people,  with  varying  amounts  of  programming  experience,  wrote  these  programs.  The 
experience  does  not  seem  to  have  been  too  unpleasant  for  any  of  them,  although  all 
found  CONSUL’S  syntax  awkward  at  times.  The  problem  is  that  each  of  CONSUL’S 
built  in  relations  represents  a  fairly  small  computation,  so  that  any  non-trivial  ex¬ 
pression  must  be  constructed  from  a  number  of  built-ins.  Many  temporary  variables 
are  needed  to  tie  these  built-ins  together  into  a  single  expression.  This  problem  is 
not  very  surprising  —  CONSUL  was  deliberately  given  a  minimal  syntax  in  order 
to  allow  quick  prototyping  in  Lisp.  It  is,  however,  time  for  a  more  usable  version  of 
the  language.  A  front  end  processor  is  therefore  being  written  to  parse  expressions 
written  in  an  algebraic  notation  into  CONSUL.  This  front  end  will  make  CONSUL 
as  it  now  exists  the  intermediate  code  for  future  interpreters  and  compilers. 

Section  4.2  mentioned  how,  in  principle,  CONSUL  programmers  can  control  the 
efficiency  of  their  programs  by  proper  choice  of  algorithms.  Several  of  the  programs 
from  Table  1  demonstrate  this  point  concretely.  For  example,  the  two  versions  of 
Rationals  differ  in  that  Version  1  uses  a  naive  declarative  description  of  what  it  means 
for  a  rational  number  to  be  in  reduced  form,  whereas  Version  2  is  based  on  Euclid’s 
algorithm.  Version  2  runs  more  than  4  times  faster  than  Version  1  (see  Section  4.3.3 
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Name 

Description 

Vector  Abs 

Absolute  values  of  elements  of  a  vector 

Vector  Sum 

Sum  of  two  vectors 

Matrix  Multiply 

Product  of  two  matrices 

Rationals  1 

Rational  number  abstract  data  type 

Rationals  2 

Improved  version  of  Rationals  1 

Lexer 

Simple  lexical  analyzer 

Database 

Skeletal  concurrent  database 

Hotel 

Simple  hotel  reservation  database 

Median 

Median  finder  (using  merge  sort) 

Puzzle 

Solves  “SEND+MORE=MONEY”  puzzle 

Knapsack 

0-1  knapsack  (Figure  3)  ' 

Simulated  Annealing 

Solves  knapsack  by  simulated  annealing 

XC 

Exact  cover 

Formatter 

Text  formatter  (in  progress) 

Table  1:  Applications  Programmed  in  CONSUL 

for  details).  Similarly,  the  key  to  making  the  concurrent  database  work  in  CONSUL 
turned  out  to  be  using  a  multiversion  concurrency  control  algorithm  rather  than  one 
based  on  locking  parts  of  a  single- version  database. 

4.3.2  Formal  Results  on  Constraint  Satisfaction 

In  order  to  execute  a  constraint  program,  one  needs  a  way  of  solving  systems  of 
constraints.  Unfortunately,  solving  systems  of  constraints  written  in  a  language  even 
remotely  suited  to  general  purpose  programming  is  an  undecidable  problem.  This 
fact  is  a  simple  consequence  of  Matijsevic’s  work  [15]  on  solvability  of  Diophantine 
equations;  see  [16]  for  details.  Restricting  the  language’s  expressive  power  doesn’t 
really  improve  the  situation:  constraint  satisfaction  in  even  trivial  languages  is  NP- 
Complete  or  harder. 

Faced  with  the  intractability  of  constraint  satisfaction,  implementors  of  constraint 
languages  have  explored  a  variety  of  heuristics  [1,  4,  7,  9,  14,  21,  22].  Local  propa¬ 
gation  [21]  seems  to  be  the  easiest  to  parallelize.  This  is  because  local  propagation, 
unlike  the  other  heuristics,  solves  a  system  of  constraints  by  solving  its  members 
relatively  independently  of  each  other.  The  information  that  must  be  communicated 
between  members  is  well  defined.  Thus  local  propagation  should  have  fewer  shared  re¬ 
source  bottlenecks  than  other  heuristics.  Unfortunately,  local  propagation  is  thought 
to  be  a  relatively  weak  heuristic.  In  order  to  determine  exactly  what  it  can  and  can’t 
do,  a  precise  definition  of  local  propagation  was  developed,  from  which  its  strengths 
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To  show  satisfiability /unsatisfiability  of  “X  =  Y  +  Zn: 
if  X  is  unbound,  Y  and  Z  are  bound 
Satisfiable,  assign  Y  +  Z  to  X.  (Method  1) 
else  if  Y  is  unbound,  X  and  Z  axe  bound 
Satisfiable,  assign  X  —  Z  to  Y.  (Method  2) 
else  if  Z  is  unbound,  X  and  Y  are  bound 
Satisfiable,  assign  X  —  Y  to  Z.  (Method  3) 
else  if  X,  Y,  and  Z  are  all  bound 
if  X  =  Y  +  Z  (Method  4) 

Satisfiable,  no  new  bindings  needed, 
else 

Unsatisfiable. 

else 

Not  enough  information  yet  for  proof. 


Figure  4:  Outline  of  Satisfier  for  “Sum”  Constraints 

and  weaknesses  can  be  proved  [16].  The  main  results  are  as  follows. 

Informally,  local  propagation  proves  or  disproves  the  satisfiability  of  a  system 
of  constraints  by  proving  each  of  the  primitive  constraints  contained  in  the  system. 
For  each  type  of  primitive  there  is  a  procedure  to  prove  (or  disprove)  satisfiability 
of  an  isolated  use  of  that  primitive.  At  the  time  such  a  procedure  is  invoked  on  a 
particular  use,  some  of  the  arguments  to  the  constraint  are  already  bound,  either  as 
a  result  of  earlier  satisfiability  proofs  or  because  they  were  constants  to  begin  with. 
Other  arguments  are  unbound.  The  satisfaction  procedure  simply  tests  the  pattern 
of  bound  and  unbound  arguments,  and  dispatches  to  a  sub-procedure  (henceforth 
called  a  method)  that  either  computes  satisfying  values  for  one  or  more  unbound 
arguments,  or  tests  bound  arguments  to  see  if  the  constraint  is  satisfied.  The  exact 
patterns  tested  and  methods  invoked  reflect  the  algebraic  properties  of  the  constraint 
being  proved.  For  example,  Figure  4  shows  pseudo-code  for  satisfying  constraints 
of  the  form  “X  =  Y  +  Z”.  The  results  of  these  local  proofs  are  communicated 
between  satisfiers  in  order  to  arrive  at  a  consistent  set  of  variable-to-value  bindings 
for  the  system  as  a  whole.  The  exact  pattern  of  communication  between  satisfiers  is 
determined  by  the  connectives  or  quantifiers  joining  the  primitive  constraints. 

Our  formalization  of  local  propagation  begins  with  the  notion  of  environment.  An 
environment  for  a  system  of  constraints  is  a  mapping  from  variables  appearing  in  the 
system  to  sets  of  values.  Local  propagation  tries  to  find  am  environment  for  a  sys¬ 
tem  of  constraints  such  that  the  system  represents  a  true  set  of  assertions  whenever 
each  variable  is  replaced  by  an  arbitrary  value  from  the  corresponding  set.  Finding 
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locaLpropagate  ( C,E ) 
while  C  7^0 
E'  *-  E 
C'  <-  0 

for  all  c  6  C  such  that  c  has  a  p,  true  of  E 
E'  <—  £'  merge  fi(E) 

C'*-C'\J{c } 

C  -  C  -  C' 

return  £ 


Figure  5:  The  Local  Propagation  Heuristic. 

such  an  environment  proves  the  system  satisfiable.  We  model  the  satisfaction  pro¬ 
cedures  from  the  informal  description  as  sets  of  the  form  {(pi,  fi),  (pn,  /»»)}•  The 
Pi  are  predicates  on  environments,  the  fi  are  functions  that  transform  environments. 
The  predicates  represent  the  tests  from  the  informal  description,  the  transformation 
functions  represent  the  associated  methods. 

Our  version  of  local  propagation  is  shown  in  Figure  5.  Local  propagation  loops 
over  a  set  of  constraints,  C,  updating  an  environment,  E.  Each  iteration  of  the 
loop  modifies  the  environment  to  satisfy  one  or  more  additional  constraints.  To 
do  this,  each  iteration  selects  from  the  as  yet  unproven  constraints  those  for  which 
some  test  predicate  is  true  of  the  current  environment.  For  each  of  these  constraints, 
the  corresponding  transformation  function  is  applied  to  the  current  environment, 
producing  a  new  environment.  The  mappings  in  this  environment  are  intersected 
with  those  of  the  untransformed  environment  to  produce  another  new  environment 
that  satisfies  both  the  selected  constraint  and  all  previously  proven  constraints.  This 
intersection  is  represented  in  the  Figure  by  the  merge  operation.  Figure  5  uses 
temporary  constraint  set  C!  and  environment  E'  to  emphasize  the  separation  between 
determining  which  constraints  can  be  proved  in  the  current  iteration  and  updating 
the  environment  and  set  of  unsolved  constraints  —  provable  constraints  are  selected 
based  on  the  unmodified  constraint  set  and  environment.  Local  propagation  stops 
when  all  constraints  have  been  proved.  The  possibility  that  no  unproven  constraint 
has  a  test  predicate  that  is  true  of  the  current  environment  is  addressed  in  a  theorem 
presented  below. 

The  ability  of  local  propagation  to  satisfy  constraints  is  characterized  by  the  fol¬ 
lowing  theorems.  For  proofs  of  these  theorems,  see  [16]. 

Theorem:  If  locaLpropagate  terminates,  then  the  environment  that  it  returns  is 
the  unique  maximal  restriction  of  E  that  satisfies  C. 

A  restriction  of  E  that  satisfies  C  is  an  environment  that  satisfies  C  and  is  derived 
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from  E  by  removing  elements  from  the  sets  to  which  variables  are  mapped  by  E. 
A  maximal  restriction  is  one  produced  by  removing  the  fewest  elements.  Complete 
definitions  of  these  terms  are  given  in  [16].  Uniqueness  of  maximal  restrictions  follows 
from  some  assumptions  about  uniqueness  of  solutions  to  primitive  constraints.  These 
assumptions  simplify  the  proof  of  the  theorem,  but  can  be  relaxed  to  allow  constraint 
systems  to  have  multiple  solutions. 

Informally,  the  above  theorem  says  that  if  local  propagation  is  applied  to  a  solv¬ 
able  system  of  constraints,  then  it  will  either  find  all  solutions  to  that  system  or  will 
fail  to  terminate.  The  maximal  restrictions  that  “satisfy”  an  unsolvable  system  of 
constraints  are  those  that  map  one  or  more  variables  to  the  empty  set.  Thus  lo- 
caLpropagate  also  proves  unsolvability  of  unsolvable  systems  (or  doesn’t  terminate). 
The  next  theorem  establishes  a  necessary  condition  for  termination  of  local  propaga¬ 
tion.  This  condition  is  easily  tested  by  locaLpropagate  as  it  runs. 

Theorem:  If  locaLpropagate  terminates,  then  at  the  beginning  of  each  iteration  of 
the  while  loop  at  least  one  of  the  constraints  remaining  to  be  solved  has  a  pi  that 
evaluates  to  true. 

The  above  theorem  is  important  for  its  uses  in  demonstrating  when  local  prop¬ 
agation  does  not  work.  For  example,  local  propagation  fails  when  confronted  by  a 
simultaneous  system  of  equations.  By  the  algebraic  properties  that  make  the  equa¬ 
tions  “simultaneous”  in  the  first  place,  no  constraint  can  be  solved  in  isolation  in  the 
initial  environment.  Thus  no  pi  can  be  true  of  the  initial  environment.  The  inability 
of  local  propagation  to  solve  such  systems  then  follows  immediately  from  the  theorem. 

4.3.3  Experimental  Results 

The  theoretical  results  in  the  previous  section  describe  when  local  propagation  is 
and  is  not  able  to  prove  satisfiability  or  unsatisfiability  of  a  system  of  constraints. 
They  do  not,  however,  say  anything  about  how  often  the  conditions  necessary  for 
local  propagation  to  succeed  occur  in  real-world  programs.  A  series  of  experiments 
with  real  CONSUL  programs  run  on  a  local  propagation  based  interpreter  is  thus 
in  progress.  These  experiments  are  designed  to  answer  several  questions  about  the 
practical  implementation  of  CONSUL,  including: 

•  What  fraction  of  real  constraints  are  provable  by  local  propagation,  and 

•  How  much  parallelism  does  local  propagation  provide  when  applied  to  real  pro¬ 
grams? 

A  summary  of  the  experiments  is  given  here;  details  of  the  experiments’  structure 
and  some  preliminary  results  can  be  found  in  [3]. 

The  parallelism  experiments  are  similar  to  those  used  by  Nicolau  and  Fisher  [17] 
to  study  imperative  languages.  Experiments  in  this  style  consist  of  running  a  program 
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sequentially,  producing  a  detailed  log  of  the  sequential  execution,  and  then  analyz¬ 
ing  potential  parallelism  in  the  log.  Because  parallelism  is  analyzed  after  the  fact, 
when  complete  information  on  data  and  control  dependencies  is  available,  optimal 
parallel  schedules  are  easy  to  produce.  Because  the  detected  parallelism  is  optimal, 
it  represents  an  upper  bound  on  the  parallelism  that  can  be  exploited  by  a  prac¬ 
tical  implementation.  There  are,  however,  several  important  lessons  to  be  learned 
from  upper  bounds.  First,  the  experiments  can  be  controlled  to  isolate  the  exact 
causes  of  the  observed  effects.  In  experiments,  the  results  reflect  only  the  behavior 
of  CONSUL  and  local  propagation,  independent  of  specific  machine  or  operating  sys¬ 
tem  overheads.  Second,  if  even  upper  bounds  on  parallelism  are  low  then  one  knows 
that  something  is  seriously  wrong  with  one’s  approach.  Since  there  was  little  prior 
experience  with  CONSUL,  with  constraint  programming  in  general,  or  with  local 
propagation,  this  was  a  very  real  possibility.  Finally,  upper  bounds  on  parallelism 
can  be  used  in  quantitative  tests  of  how  well  truly  parallel  implementations  perform, 
and  may  help  developers  find  bottlenecks  in  those  implementations. 

Data  for  these  experiments  are  taken  from  a  CONSUL  interpreter.  This  interpreter 
uses  local  propagation  to  run  CONSUL  programs,  requiring  the  user  to  solve  any 
constraints  that  local  propagation  cannot  handle.2  In  addition  to  producing  the  logs 
needed  for  parallelism  analysis,  the  interpreter  also  keeps  track  of  the  total  number 
of  constraints  proved  satisfiable  or  unsatisfiable  and  the  number  of  these  proofs  that 
were  done  by  local  propagation.  These  statistics  are  the  basis  for  a  study  of  how  often 
real-world  constraints  are  provable  by  local  propagation.  The  interpreter’s  log  shows 
the  order  in  which  constraints  were  proved,  the  variables  whose  values  were  needed 
as  inputs  to  each  proof,  and  the  variables  whose  values  became  defined  as  a  result 
of  each  proof.  This  information  makes  data  dependencies  between  proofs  explicit, 
allowing  analysis  of  potential  parallelism.  This  analysis  is  done  by  a  program  called 
the  compactor.  The  compactor  generates  a  parallel  schedule  according  to  which 
the  constraints  in  a  log  could  have  been  proved.  Each  constraint  is  scheduled  for 
the  earliest  time  at  which  all  of  its  inputs  are  available.  The  compactor  assumes 
that  an  unbounded  number  of  processors  are  available,  that  there  is  no  overhead  for 
communication  between  processors,  and  that  all  constraints  take  unit  time  to  prove. 
These  assumptions  sure  gross  simplifications  of  the  real  world,  but  are  reasonable  given 
the  goal  of  finding  upper  bounds  on  parallelism  in  CONSUL  programs,  independent 
of  particular  machines  and  operating  systems.  The  interpreter  and  compactor  are 
both  written  in  Common  Lisp,  and  run  on  Texas  Instruments  Explorer  workstations. 

Table  2  shows  some  of  the  results  from  the  experiments.  The  “Input  Size”  column 
gives  the  size  of  the  input  to  the  CONSUL  program,  measured  in  units  appropriate 
to  the  way  that  program  exploits  parallelism.  For  Vector  Sum  and  Vector  Abs,  input 
size  is  the  length  of  the  vectors  involved.  For  Database,  input  size  is  the  number  of 

2Users  won’t,  of  course,  be  expected  to  solve  constraints  in  production  CONSUL  implementations, 
only  in  the  experimental  prototype. 
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Program 

Input 

Size 

Constraints 

Schedule 

Length 

Speed 

Up 

Vector  Abs 

1 

11 

4 

2.7 

Vector  Abs 

*  5 

43 

4 

10.7 

Vector  Abs 

10 

83 

4 

20.7 

Vector  Abs 

20 

163 

4 

40.7 

Vector  Sum 

1 

22 

6 

3.7 

Vector  Sum 

5 

70 

6 

11.7 

Vector  Sum 

10 

130 

6 

21.7 

Vector  Sum 

20 

250 

6 

41.7 

Database 

6 

1045 

54 

19.4 

Database 

16 

5999 

106 

56.6 

Rationais  1 

1 

250 

9 

27.8 

Rationals  2 

1 

56 

10 

5.6 

Lexer 

2 

438 

9 

48.7 

Lexer 

5 

6852 

12 

571.0 

Table  2:  Parallelization  Results 


record  read  and  write  operations  requested.  The  rationed  numbers  programs  exercise 
the  abstract  data  type  by  performing  simple  arithmetic  on  rational  numbers;  their 
inputs  are  measured  in  number  of  arithmetic  operations  requested.  Lexer  recognizes 
words  in  a  character  stream;  its  input  size  is  the  number  of  words  in  the  input. 
The  “Constraints”  column  of  the  table  shows  the  total  number  of  constraints  that 
the  interpreter  tried  to  prove  while  executing  a  program.  “Schedule  Length”  is  the 
number  of  time  steps  in  the  parallel  schedule,  and  “Speed  Up”  is  the  net  speed  up  of 
the  parallel  schedule  over  a  purely  sequential  one.  All  speed  ups  are  rounded  to  the 
nearest  tenth. 

Table  2  demonstrates  several  points.  First,  parallel  speed  ups  are  very  high, 
especially  for  the  larger  inputs.  Similar  speed  ups  have  been  found  in  experiments 
that  simulate  parallelism  in  logic  programming  languages  [20].  The  speed  ups  for 
Vector  Abs  and  Vector  Sum  are  particularly  interesting.  The  ability  to  schedule 
these  programs  into  a  constant  number  of  steps,  regardless  of  input  size,  is  due  to 
local  propagation’s  complete  exploitation  of  data  parallelism  in  them.  These  points 
indicate  that  CONSUL  may  be  competitive  with  other  declarative  languages  as  a 
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source  of  parallelism.  Second,  all  of  the  constraints  in  these  programs  were  solved  (or 
shown  to  be  unsolvable)  by  local  propagation.  In  fact,  of  all  the  CONSUL  programs 
written,  the  only  one  that  local  propagation  clearly  failed  to  execute  was  deliberately 
written  to  exploit  local  propagation’s  weaknesses.3  This  result  bolsters  the  conjecture 
that  constraint  systems  that  cannot  be  solved  by  local  propagation  are  fairly  rare 
in  real-world  programs.  The  interpreter  presently  proves  around  7  constraints  per 
second.  Although  slow  by  production  standards,  this  is  an  acceptable  speed  for  an 
experimental  tool  to  which  no  significant  optimization  efforts  have  been  applied. 

4.3.4  Parallel  Implementation 

The  eventual  goal  of  the  CONSUL  project  is  to  develop  a  compiler  that  can  detect 
implicit  parallelism  in  constraint  programs  and  produce  object  ~ode  that  exploits  that 
parallelism  on  multiprocessors.  Research  has  begun  to  study  the  problems  that  such 
a  compiler  will  have  to  address.  Some  high  level  design  of  the  compiler  has  been  done, 
but  no  code  has  been  written  yet. 

The  compiler  solves  two  problems:  How  to  partition  a  system  of  constraints  into 
processes  that  match  the  granularity  of  the  target  machine,  and  how  to  organize 
communication  between  those  processes.  Analysis  of  communication  patterns  is  done 
first,  by  the  mode  analysis  phase  of  the  compiler.  Mode  analysis  figures  out  which 
variables  should  be  inputs  and  whirV  _  iputs  in  the  proofs  of  each  constraint.  Parti¬ 
tioning  combines  proofs  of  mult, pie  constraints  into  a  single  process. 

Mode  analysis  of  CONSUL  programs  will  be  similar  to  mode  analysis  of  logic 
programs  [5].  The  mode  analyzer  will  examine  each  constraint  in  a  program,  deciding 
whether  each  variable  appearing  in  that  constraint  is  an  input  to  the  proof  of  the 
constraint  or  an  output  from  it.  The  mode  analyzer’s  decisions  are  encoded  in  the 
mode  (“input”  or  “output”)  that  it  assigns  to  each  use  of  each  variable.  Programmers 
declare  the  variables  that  will  be  input  to  and  output  from  a  program  as  a  whole. 
Starting  with  this  information,  the  mode  analyzer  infers  the  modes  of  other  variables 
according  to  the  following  rules: 

1.  Variables  declared  to  be  program  inputs  only  have  mode  “input”. 

2.  Other  variables  have  mode  “output”  in  exactly  one  statement  and  mode  “in¬ 
put”  in  all  others. 

3.  All  mode  assignments  must  be  consistent  with  one  of  the  methods  available  to 
local  propagation  for  solving  the  kind  of  constraint  in  which  they  appear. 

The  above  rules  are  consequences  of  CONSUL’S  semantics  and  the  use  of  lo¬ 
cal  propagation  to  implement  those  semantics.  For  example,  Rule  2  follows  from 

3There  are  other  programs  that  the  interpreter  cannot  yet  execute,  but  apparently  because  of 
bugs  in  the  interpreter  rather  than  problems  with  local  propagation. 


CONSUL  being  a  single-assignment  language  and  Rule  3  from  the  assumption  that 
programs  will  be  executed  by  local  propagation.  However,  these  rules  do  not  always 
determine  a  unique  mode  for  every  variable  in  a  program.  If  the  mode  analyzer  ever 
encounters  a  statement  for  whose  variables  no  mode  assignment  is  legal,  it  will  back¬ 
track  to  the  last  statement  for  which  multiple  mode  assignments  were  possible  and  try 
again.  It  appears  that  backtracking  can  be  nearly  eliminated  by  adding  the  following 
heuristic  (reflecting  programmers’  tendencies  to  define  values  before  using  them,  but 
over-ridable  by  the  mode  analyzer  if  it  contradicts  any  of  the  other  rules)  to  mode 
analysis: 

1.  Assign  the  first  occurrence  of  each  variable  mode  “output”. 

Although  using  these  rules  is  generally  straight-forward,  there  are  a  few  subtleties 
to  CONSUL  mode  analysis.  For  example,  the  “or”  connective  requires  rule  2  to  be 
modified  to  allow  a  variable  to  have  mode  “output”  once  in  each  arm  of  an  “or”;  initial 
modes  for  the  formal  parameters  to  a  user-defined  relation  are  taken  from  the  modes 
of  the  corresponding  actual  parameters,  so  relation  bodies  will  be  analyzed  several 
times  if  they  are  called  with  multiple  actual  parameter  modes;  etc.  In  pathological 
cases,  mode  analysis  can  take  time  exponential  in  both  the  number  of  statements  in  a 
program  (because  of  the  backtracking)  and  the  number  of  user-defined  relations  in  it 
(because  of  the  multiple  analysis).  In  practice,  mode  analysis  is  expected  to  be  nearly 
linear  in  the  number  of  statements  in  a  program.  The  mode  analysis  algorithm  has 
been  applied  by  hand  to  a  few  CONSUL  programs,  so  far  bearing  out  the  expectation 
of  linear  time  and  finding  essentially  the  same  modes  as  used  by  the  interpreter  in 
actual  execution. 

A  data  flow  graph,  showing  how  values  will  propagate  between  constraints,  can 
be  built  from  the  results  of  mode  analysis.  Most  of  the  data  flow  graph  is  explicit 
in  the  input/output  relationships  revealed  by  mode  analysis.  Data  flow  edges  across 
iterations  of  a  “forall”  or  between  recursive  invocations  of  user-defined  relations  are 
the  only  ones  not  explicit,  and  they  are  easily  deduced.  A  system  of  processes  could  be 
built  directly  from  the  data  flow  graph,  with  each  node  proved  by  a  separate  process 
and  processes  dynamically  scheduled  for  execution  as  soon  as  all  of  their  inputs  are 
available.  However,  single  constraints  (i.e.,  graph  nodes)  are  too  fine-grained  to  be 
good  processes  on  current  multiprocessors.  The  partitioning  phase  of  the  compiler 
thus  combines  individual  constraints  into  coarser  granularity  groups.  Although  no 

partitioning  algorithm  has  been  found  yet,  there  are  several  possible  sources.  One 
is  existing  methods  of  partitioning  programs  in  other  declarative  (mainly  functional) 
languages,  for  example  [11,  12].  Analogs  of  these  methods,  in  which  the  functional 
framework  in  which  they  were  developed  was  replaced  by  a  relational  one,  should  be 
immediately  applicable  to  CONSUL.  Another  source  of  partitioning  heuristics  is  work 
on  partitioning  networks,  for  example  [18].  This  work  appears  ideal  for  minimizing 
communication  between  processes,  but  may  not  address  other  aspects  of  partitioning 
(e.g.,  load  balancing). 
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4.4  Summary  and  Conclusions 


This  section  has  described  the  current  state  of  the  CONSUL  project.  CONSUL  is  a 
constraint  based  language  for  experimenting  with  implicitly  parallel  programming.  It 
represents  the  first  intensive  study  of  parallelism  in  constraint  languages  other  than 
variants  of  Prolog.  To  date  the  CONSUL  project  has: 

•  Defined  a  constraint  language  that  is  suitable  for  parallel  programming  and 
demonstrated  programs  in  it  for  a  number  of  applications, 

•  Developed  a  formal  definition  of  local  propagation,  a  parallelizable  constraint 
satisfaction  heuristic, 

•  Produced  preliminary  experimental  evidence  suggesting  that  local  propagation 
is  able  to  prove  the  bulk  of  the  constraints  found  in  real  programs,  providing 
considerable  parallelism  in  doing  so,  and 

•  Begun  to  plan  multiprocessor  implementations  of  the  language. 

Some  conclusions  from  this  work  axe: 

•  CONSUL  lets  programmers  describe  efficient  algorithms  to  solve  real  problems, 
yet  in  a  language  whose  basic  semantics  does  not  interfere  with  parallelism. 

•  Real  CONSUL  programs  may  exhibit  considerable  parallelism. 

•  Local  propagation  is  an  effective  mechanism  for  executing  these  programs. 

CONSUL  is  an  on-going  project,  not  a  finished  product,  and  this  section  is  a  report 
on  its  current  status.  Thus  the  above  conclusions  axe  still  only  preliminary.  A  great 
deal  of  work  remains  to  be  done  in  all  areas.  The  results  so  far  certainly  do  not  prove 
that  CONSUL  (or  any  constraint  language)  is  the  perfect  parallel  programming  tool, 
but  they  do  justify  increased  interest  in  parallel  constraint  languages  and  continuing 
efforts  to  test  their  utility. 

5  Parallel  Operating  Systems  and  the  Psyche 
Project 

The  centerpiece  of  the  CER  hardware  grant  (on  which  the  DARPA  research  was 
based)  was  the  purchase  in  1985  of  a  128-node  BBN  Butterfly  Parallel  Processor. 
Over  the  course  of  the  contract  this  machine  was  used  to  support  research  in  paral¬ 
lel  programming  systems,  computer  vision,  massively  parallel  connectionist  models, 
and  the  theory  of  parallel  computation.  CER  allowed  us  to  acquire  and  experiment 
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with  several  generations  of  the  Butterfly  Parallel  Processor  from  BBN-ACI.  In  par¬ 
ticular,  a  later- generation  Butterfly  was  obtained  for  operating  systems  research  and 
applications.  Psyche  is  now  the  major  activity  surrounding  the  Butterfly.  Activity 
in  the  Psyche  group  involves  directly  or  indirectly  two  faculty  members  and  four  to 
six  graduate  students.  Psyche  was  running  its  first  jobs  just  when  the  CER  support 
terminated,  and  since  then  it  has  been  expanding  in  usefulness  to  the  user  community. 

One  goal  was  to  create  a  programming  environment  for  MIMD  (Multiple  instruc¬ 
tion  stream,  multiple  data  stream)  style  computers.  This  architecture  is  complemen¬ 
tary  to  other  styles  of  parallel  computing  such  as  SIMD  (in  which  identical  computa¬ 
tions  are  performed  in  parallel  to  different  data)  and  neural  nets.  CER  allowed  us  to 
acquire  and  experiment  with  several  generations  of  the  Butterfly  Parallel  Processor 
from  BBN-ACI. 

The  problem  with  MIMD  computation,  which  admits  multiple  independent  co¬ 
operating  large  processes  and  processors  to  run  concurrently,  is  that  the  interactions 
between  programs  (for  instance  their  data  accessing)  are  extremely  hard  to  moni¬ 
tor  and  even  to  repeat,  given  the  potential  for  race  conditions  and  the  scheduling 
differences  that  can  take  place  from  run  to  run.  Further,  there  are  several  compet¬ 
ing,  individually  adequate  models  of  parallel  programs  at  this  level.  For  instance, 
message-passing  models  and  shared-memory  models  offer  rather  different  user  views 
of  the  computational  resource.  Although  hardware  was  being  built  (like  the  BBN  But¬ 
terfly  Parallel  Processor )  to  support  different  models  of  parallel  computation,  there 
was  a  serious  lack  in  the  current  state  of  the  art  of  an  operating  system  to  support 
several  such  models  at  once. 

To  improve  the  state  of  the  art  in  programming,  conceptualizing,  monitoring  per¬ 
formance,  and  optimizing  efficiency  in  MIMD  computation,  we  developed  systems  like 
PSYCHE  (an  operating  system),  CONSUL  (an  intelligent  autoparallelizing  compiler), 
and  MOVIOLA  (a  kit  of  performance  monitoring  and  debugging  tools.)  Altogether  we 
also  produced  and  exported  about  a  dozen  other  less  ambitious  systems  and  libraries. 
The  interaction  of  the  MOVIOLA  debugging  and  performance  monitoring  tools  have 
had  unexpected  efficacy  not  just  in  debugging  but  in  algorithm  development. 


5.1  Early  Work 

At  the  time  our  Butterfly  was  purchased  it  was  not  yet  clear  whether  shared  memory 
would  be  practical  in  large-scale  multiprocessors.  Previous  architectures  had  been 
limited  in  size;  our  Butterfly  and  its  twin  at  BBN  were  for  several  years  the  largest 
shared-memory  multiprocessors  in  the  world,  by  a  large  margin.  Potential  problems 
with  memory  and  interconnect  contention,  the  management  of  highly- parallel  shared 
data  structures,  and  the  need  to  maximize  locality  of  reference  made  our  purchase  a 
risky  venture.  One  of  the  most  important  results  of  our  research  was  to  show  that 
none  of  these  problems  is  insurmountable.  We  used  the  Butterfly  to  obtain  significant 
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speedups  (often  nearly  linear)  on  over  100  processors  with  a  range  of  applications  that 
includes  various  aspects  of  computer  vision  [Brown  et  al.  1986;  Brown  1988b;  Olson 
et  al.  1987;  Olson  1986b, c],  connectionist  network  simulation  [Feldman  et  al.  1988b], 
numerical  algorithms  [LeBlanc  1987,  1988a],  computational  geometry  [Bukys  1986], 
graph  theory  [Costanzo  et  al.  1986],  combinatorial  search  [LeBlanc  et  al.  1988;  Scott 
1989],  lexical  and  syntactic  analysis  [Gafter  1987,  1988],  and  parallel  data  structure 
management  [Mellor-Crummey  1987]. 

We  also  demonstrated,  through  our  research  in  parallel  programming  environ¬ 
ments  and  tools,  that  shared-memory  machines  are  flexible  enough  to  support  effi¬ 
cient  implementations  of  a  wide  range  of  programming  models,  with  both  coarse  and 
fine-grain  parallelism. 

From  1984  to  1987,  our  systems  work  is  best  characterized  as  a  period  of  experi¬ 
mentation,  designed  to  evaluate  the  potential  of  large  NUMA  (non-uniform  memory 
access)  multiprocessors  and  to  assess  the  need  for  software  tools.  In  the  course  of 
this  experimentation  we  ported  three  compilers  to  the  Butterfly  [Scott  1989;  Olson 
l9S6a;  Crowl  1988b],  developed  five  major  and  several  minor  library  packages  [Crowl 
1988b;  Low  1986;  LeBlanc  1988b;  LeBlanc  and  Jain  1987;  Scott  and  Jones  1988; 
Olson  1986;  LeBlanc  and  Mellor-Crummey  1986;  Fowler  et  al.  1989],  and  built  a 
parallel  file  system  [Dibble  and  Scott  1989a, b;  Dibble  et  al.  1988]  and  two  different 
operating  systems  [LeBlanc  et  al  1989b;  Cox  and  Fowler  1989].  Our  work  with  the 
Lynx  distributed  programming  language  [Scott  1987]  yielded  important  information 
on  the  inherent  costs  of  message  passing  [Scott  and  Cox  1987]  and  the  semantics 
of  the  parallel  language/operating  system  interface  [Scott  1986].  Experience  with 
a  C++  communication  library  yielded  similar  insights  for  object-oriented  systems 
[Crowl  1988b]. 

A  major  focus  of  our  experimentation  with  the  Butterfly  was  the  evaluation  and 
comparison  of  multiple  models  of  parallel  computing  [Brown  et  al.  1986;  LeBlanc 
et  al.  1988;  LeBlanc  1986,  1988a].  BBN  had  already  developed  a  model  based  on 
fine- grain  memory  sharing  [LeBlanc  1986].  In  addition,  among  the  programming 
environments  listed  above,  we  have  implemented  remote  procedure  calls  [Low  1986]; 
an  object-oriented  encapsulation  of  processes,  memory  blocks,  and  messages  [Crowl 
1988b];  a  message-based  library  package  [LeBlanc  1988b];  a  shared-memory  model 
with  numerous  lightweight  processes  [Scott  and  Jones  1988];  and  a  message-based 
programming  language  [Scott  1989].  In  an  intensive  benchmark  study  conducted  in 
1986  [Brown  et  al.  1986],  we  implemented  seven  different  computer  vision  applications 
on  the  Butterfly  over  the  course  of  a  three-week  period.  Based  on  the  characteristics  of 
the  problems,  programmers  chose  to  use  four  different  programming  models,  provided 
by. four  of  our  systems  packages.  For  one  of  the  applications,  none  of  the  existing 
packages  provided  a  reasonable  fit,  and  the  awkwardness  of  the  resulting  code  was  a 
major  impetus  for  the  development  of  yet  another  package  [Scott  and  Jones  1988]. 

Our  principal  conclusion  from  this  experimentation  was  that  while  every  pro- 
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gramming  model  has  applications  for  which  it  seems  appropriate,  no  single  model 
is  appropriate  for  every  application.  Just  as  a  general-purpose  uniprocessor  system 
must  permit  programs  to  be  written  in  a  wide  variety  of  languages  (encompassing 
a  wide  variety  of  models  of  sequential  computation),  we  formed  the  belief  that  a 
general-purpose  multiprocessor  system  must  permit  programs  to  be  written  under  a 
wide  variety  of  parallel  programming  models.  This  conviction  motivated  development 
of  the  Psyche  operating  system. 

5.2  Psyche  Motivation 

As  outlined  above,  our  early  work  led  to  several  conclusions. 

1)  Large-scale  shared-memory  multiprocessors  are  practical.  We  achieved  signif¬ 
icant  speedups  (often  almost  linear)  using  over  100  processors  on  a  wide  range  of 
applications  with  many  different  operating  systems,  library  packages,  and  languages. 
Shared-memory  multiprocessors  appear  to  be  able  to  support  coarse-grain  parallelism 
just  as  efficiently  as  message-based  multicomputers,  while  simultaneously  support¬ 
ing  very  fine-grain  interactions.  They  provide  an  extremely  flexible  foundation  for 
general-purpose  parallel  computing,  and  for  high-level  vision  in  particular. 

2)  Programmers  need  multiple  models  of  parallel  computation.  Though  many  styles 
of  communication  and  process  structure  can  be  implemented  efficiently  on  a  shared 
memory  machine,  no  single  model  can  provide  optimal  performance  for  all  applica¬ 
tions.  Moreover,  subjective  experience  indicates  that  conceptual  clarity  and  ease  of 
programming  are  maximized  by  different  models  for  different  kinds  of  applications. 
In  the  course  of  our  DARPA  benchmark  experiments  [Brown  et  al.  1986],  seven  dif¬ 
ferent  problems  were  implemented  using  four  different  programming  models.  One  of 
the  basic  conclusions  of  the  study  was  that  none  of  the  models  then  available  was 
appropriate  for  certain  graph  problems;  this  experience  led  to  the  development  of  the 
Ant  Farm  library  package  [Scott  and  Jones  1988].  Large  embedded  applications  (such 
as  vision)  may  well  require  different  programming  models  for  different  components;  it 
therefore  seemed  important  to  be  able  to  communicate  across  programming  models 
as  well. 

3)  An  efficient  implementation  of  a  shared  name  space  is  valuable  even  in  the 
absence  of  uniform  access  time.  We  found  one  of  the  primary  advantages  of  shared 
memory  to  be  its  familiar  computational  model.  A  uniform  addressing  environment 
allows  programs  to  pass  pointers  and  data  structures  containing  pointers  without 
explicit  translation.  This  uniformity  of  naming  appears  to  be  the  primary  reason 
why  programmers  choose  to  use  BBN’s  Uniform  System  package.  Even  when  non- 
uniform  access  times  force  the  programmer  to  deal  explicitly  with  local  cacheing, 
shared  memory  continues  to  provide  a  form  of  global  name  space  that  supports  easy 
copying  of  data  from  one  location  to  another. 
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4)  Dynamic  fine-grain  sharing  is  important  for  many  applications.  It  is  often 
difficult  to  specify  at  creation  time  which  data  objects  will  be  shared  and  which 
private,  which  local  and  which  global,  which  long-lived  and  which  temporary.  It  can 
be  particularly  difficult  to  specify  which  processes  will  need  access  to  specific  pieces 
of  data,  and  wasteful  to  require  processes  to  demonstrate  access  rights  for  data  they 
may  never  use.  Far  preferable  is  a  scheme  in  which  all  objects  are  potentially  sharable 
and  treated  uniformly,  with  access  control  and  other  bookkeeping  performed  as  late 
as  possible.  Such  a  scheme  provides  the  user  with  greater  latitude  in  program  design, 
minimizes  resource  usage,  and  facilitates  migration  to  maximize  locality  and  balance 
workloads. 

5)  Maximum  performance  and  flexibility  depend  on  a  low-level  kernel  interface. 
From  the  point  of  view  of  an  individual  application,  the  ideal  operating  system  prob¬ 
ably  lies  at  one  of  two  extremes:  it  either  provides  every  facility  the  application  needs, 
or  else  provides  a  flexible  and  efficient  set  of  primitives  from  which  those  facilities  can 
be  built.  A  kernel  that  lies  in  between  is  likely  to  be  both  awkward  and  slow:  awkward 
because  it  has  sacrificed  the  flexibility  of  the  more  primitive  system,  slow  because  it 
has  sacrificed  its  simplicity.  Moreover  a  kernel  with  a  high-level  interface  is  unlikely  to 
be  able  to  provide  facilities  acceptable  to  every  application.  Low-level  primitives  can 
be  much  more  universal.  They  imply  the  need  for  friendly  software  packages  that  run 
on  top  of  the  kernel  and  under  user  programs,  but  with  a  carefully-designed  interface 
these  can  be  as  efficient  as  kernel-level  code  and  much  less  difficult  to  change. 

6)  A  high-quality  programming  environment  is  essential.  Some  application  pro¬ 
grammers  in  our  department  who  could  have  exploited  the  parallelism  offered  by 
the  Butterfly  continued  to  use  Sun  workstations  and  VAXen.  These  programmers 
weighed  the  potential  speedup  of  the  Butterfly  against  the  programming  environ¬ 
ment  of  their  workstation  and  found  the  Butterfly  wanting.  Of  particular  importance 
are  tools  for  parallel  debugging.  Our  work  with  Instant  Replay  [LeBlanc  and  Mellor- 
Crummey  1987;  Fowler  et  al.  1988]  clearly  provided  an  important  step  in  the  right 
direction.  A  high-quality  environment  for  performance  monitoring,  called  Moviola, 
was  also  created. 


5.3  Psyche 

Preliminary  ideas  for  Psyche  date  to  1986.  Design  work  began  in  earnest  in  1987 
and  was  essentially  completed  by  the  summer  of  1988,  when  implementation  began 
on  the  BBN  Butterfly  Plus  multiprocessor.  Early  plains  for  Psyche  were  summarized 
in  a  1987  technical  report  [Scott  and  LeBlanc  1987].  Rationale  for  the  design  was 
presented  in  1988  [Scott  and  Marsh  1988].  Technical  reports  on  the  user/kernel 
interface  [Scott  et  al.  1989a]  and  the  memory  management  system  [LeBlanc  et  al. 
1989a]  appeared  in  1989,  and  were  followed  by  workshop  papers  on  open-systems 
design  and  the  kernel  implementation  [Scott  et  al.  1989b, c).  A  detailed  discussion  of 
multi-model  programming  appeared  at  the  1989  PPoPP  conference. 
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The  design  of  Psyche  is  based  on  the  observation  that  access  to  shared  memory 
is  the  fundamental  mechanism  for  interaction  between  threads  of  control  on  a  multi¬ 
processor.  Any  other  abstraction  that  can  be  provided  on  the  machine  must  be  built 
from  this  basic  mechanism.  An  operating  system  whose  kernel  interface  is  based  on 
direct  use  of  shared  memory  will  thus  in  some  sense  be  universal. 

The  realm  is  the  central  abstraction  provided  by  the  Psyche  kernel.  Each  realm 
includes  data  and  code.  The  code  constitutes  a  protocol  for  manipulating  the  data  and 
for  scheduling  threads  of  control.  The  intent  is  that  the  data  should  not  be  accessed 
except  by  obeying  the  protocol.  In  effect,  a  realm  is  an  abstract  data  object.  Its 
protocol  consists  of  operations  on  the  data  that  define  the  nature  of  the  abstraction. 
Invocation  of  these  operations  is  the  principal  mechanism  for  communication  between 
parallel  threads  of  control. 

The  thread  is  the  abstraction  for  control  flow  and  scheduling.  All  threads  that 
begin  execution  in  the  same  realm  reside  in  a  single  protection  domain.  That 
domain  enjoys  access  to  the  original  realm  and  any  other  realms  for  which  access 
rights  have  been  demonstrated  to  the  kernel.  Part  of  the  layout  of  a  thread  context 
block  is  defined  by  the  kernel,  but  threads  themselves  are  created  and  scheduled  by 
the  user.  The  kernel  time-slices  on  each  processor  between  protection  domains  in 
which  threads  are  active,  providing  upcalls  at  quantum  boundaries  and  whenever 
else  a  scheduling  decision  is  required.  Context  switches  between  threads  in  the  same 
protection  domain  do  not  require  kernel  intervention.  In  addition,  a  standardized 
interface  to  scheduling  routines  allows  threads  of  different  types  to  block  and  unblock 
each  other. 

The  relationship  between  realms  and  threads  is  somewhat  unusual:  the  conven¬ 
tional  notion  of  an  anthropomorphic  process  has  no  analog  in  Psyche.  Realms  are 
passive  objects,  but  their  code  controls  all  execution.  Threads  merely  animate  the 
code;  they  have  no  “volition”  of  their  own. 

Depending  on  the  degree  of  protection  desired,  an  invocation  of  a  realm  operation 
can  be  as  fast  as  an  ordinary  procedure  call  or  as  slow  as  a  heavyweight  process 
switch.  We  call  the  inexpensive  version  an  optimized  invocation;  the  safer  version  is 
a  protected  invocation.  In  the  case  of  a  trivial  protocol  or  truly  minimal  protection, 
Psyche  also  permits  direct  external  access  to  the  data  of  a  realm.  One  can  think  of 
direct  access  as  a  mechanism  for  in-line  expansion  of  realm  operations.  By  mixing 
the  use  of  protected,  optimized,  and  in-line  invocations,  the  programmer  can  obtain 
(and  pay  for)  as  much  or  as  little  protection  as  desired. 

Keys  and  access  lists  are  the  mechanisms  used  to  implement  protection.  Each 
realm  includes  an  access  list  consisting  of  <key,  right>  pairs.  Each  thread  maintains 
a  list  of  keys.  The  right  to  invoke  an  operation  of  a  realm  is  conferred  by  possession 
of  a  key  for  which  appropriate  permissions  appear  in  the  realm’s  access  list.  A  key 
is  a  large  uninterpreted  value  affording  probabilistic  protection.  The  creation  and 
distribution  of  keys  and  the  management  of  access  lists  are  all  under  user  control, 
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enabling  the  implementation  of  many  different  protection  policies. 

If  optimized  (particularly  in-line)  invocations  are  to  proceed  quickly,  they  must 
avoid  modification  of  memory  maps.  Every  realm  visible  to  a  given  thread  must 
therefore  occupy  a  different  location  from  the  point  of  view  of  that  thread.  In  addition, 
if  pointers  are  to  be  stored  in  realms,  then  every  realm  visible  to  multiple  threads 
must  occupy  the  same  location  from  the  point  of  view  of  each  of  those  threads.  In 
order  to  satisfy  these  two  requirements,  Psyche  arranges  for  all  coexistent  sharable 
realms  to  occupy  disjoint  locations  in  a  single,  global,  virtual  address  space.  Each 
protection  domain  may  have  a  different  view  of  this  address  space,  in  the  sense  that 
different  subsets  may  be  marked  accessible,  but  the  virtual  to  physical  mapping  does 
not  change. 

The  view  of  a  protection  domain  is  embodied  in  the  hardware  memory  map. 
Execution  proceeds  unimpeded  until  an  attempt  is  made  to  access  something  not 
included  in  the  view.  The  resulting  protection  fault  is  fielded  by  the  kernel,  whose 
job  it  is  to  either  (1)  announce  an  error,  (2)  update  the  current  view  and  restart  the 
faulting  instruction,  or  (3)  perform  an  upcall  into  the  protection  domain  associated 
with  the  target  realm,  in  order  to  create  a  new  thread  to  perform  the  attempted 
operation.  In  effect,  Psyche  uses  conventional  memory-management  hardware  as  a 
cache  for  software-managed  protection.  Case  (2)  corresponds  to  optimized  invocation. 
Future  invocations  of  the  same  realm  from  the  sarnie  protection  domain  will  proceed 
without  kernel  intervention.  Case  (3)  corresponds  to  protected  invocation.  The  choice 
between  cases  is  made  by  matching  the  key  list  of  the  current  thread  against  the  access 
list  of  the  target  realm. 

For  both  locality  and  communication,  the  philosophy  of  Psyche  is  to  provide  a 
fundamental,  low-level  mechanism  from  which  a  wide  variety  of  higher-level  facilities 
can  be  built.  Realms  with  appropriate  protocol  operations  can  be  used  to  implement 
the  following: 

1.  Pure  shared  memory  in  the  style  of  the  BBN  Uniform  System  [Thomas  1988]. 
A  single  large  collection  of  realms  would  be  shared  by  all  threads.  The  access 
protocol,  in  an  abstract  sense,  would  permit  unrestricted  reads  and  writes  of 
individual  memory  cells. 

2.  Packet-switched  message  passing.  Each  message  would  be  a  separate  realm. 
To  send  a  message  one  would  make  the  realm  accessible  to  the  receiver  and 
inaccessible  to  the  sender. 

3.  Circuit-switched  message  passing,  in  the  style  of  Accent  [Rashid  and  Robertson 
1981)  or  Lynx  [Scott  1987].  Each  communication  channel  would  be  realized  as 
a  realm  accessible  to  a  limited  number  of  threads  and  would  contain  buffers 
manipulated  by  protocol  operations. 
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4.  Synchronization  mechanisms  such  as  monitors,  locks,  and  path  expressions. 
Each  of  these  can  be  written  once  as  a  library  routine  that  is  instantiated  as  a 
realm  by  each  abstraction  that  needs  it. 

5.  Parallel  data  structures.  Special-purpose  locking  could  be  implemented  in  a 
collection  of  realms  scattered  across  the  nodes  of  the  machine,  in  order  to  reduce 
contention.  The  entry  routines  of  the  data  structure  as  a  whole  might  be  fully 
parallel,  able  to  be  executed  without  synchronization  until  access  is  required  to 
particular  pieces  of  the  data. 

Psyche  provides  a  low-level  interface  with  uniform  naming  and  an  emphasis  on 
dynamic  fine-grained  sharing.  Through  its  use  of  data  abstraction,  lazy  evaluation  of 
protection,  and  parameterized  user-level  scheduling,  it  allows  programs  written  un¬ 
der  many  different  programming  models  to  coexist  and  interact.  The  conventions  of 
realm  protocols,  upcalls,  and  block  and  unblock  routines  provide  a  structure  for  com¬ 
munication  across  models  that  is,  to  the  best  of  our  knowledge,  unprecedented.  With 
appropriate  permissions,  user-level  code  can  exercise  full  control  over  the  physical 
resources  of  memory,  processors,  and  devices.  In  effect,  it  should  be  possible  un¬ 
der  Psyche  to  implement  almost  any  application  for  which  the  underlying  hardware 
is  appropriate.  This,  for  us,  constitutes  the  definition  of  “general-purpose  parallel 
computing.” 

Psyche  differs  from  existing  multiprocessor  operating  systems  in  several  funda¬ 
mental  ways. 

1.  It  employs  a  uniform  name  (address)  space  for  all  its  user  programs  without 
relying  on  compiler  support  for  protection. 

2.  It  evaluates  access  rights  lazily,  permitting  the  distribution  of  rights  without 
kernel  intervention. 

3.  It  places  the  management  of  threads,  and  in  fact  their  definition,  in  the  hands 
of  user-level  code. 

4.  It  minimizes  the  need  for  kernel  calls  in  general  by  relying  whenever  possible 
on  shared  user/kernel  data  structures  that  can  be  examined  asynchronously. 

5.  It  provides  the  user  with  an  explicit  tradeoff  between  protection  and  perfor¬ 
mance  by  facilitating  the  interchange  of  protected  and  optimized  invocations. 

The  kernel  provides  the  foundation  for  a  wide  variety  of  future  work  in  parallel 
systems  as  well  as  for  applications  (including  real-time  artificial  intelligence).  It  is 
conceived  as  a  lowest  common  denominator  for  a  multiprocessor  operating  system, 
providing  only  those  functions  necessary  to  access  physical  resources  and  implement 
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protection  in  higher  layers.  The  three  fundamental  kernel  abstractions  are  the  seg¬ 
ment,  the  address  space,  and  the  thread  of  control.  All  three  are  protected  through 
capabilities.  Unusual  features  include  an  inter- address-space  communication  mecha¬ 
nism  based  on  explicit  transfer  of  control  between  threads  and  a  facility  for  reflecting 
memory  protection  violations  upwards  into  user-space  fault  handlers. 

As  of  November  1989  we  were  able  to  run  our  first  real  user  applications.  Imple¬ 
mented  portions  of  Psyche  include: 

•  Low-level  machine  support:  interrupt  handlers,  virtual  memory  (without  pag¬ 
ing),  full  support  for  inter-kernel  shared  memory,  synchronous  inter- kernel  com¬ 
munication  via  remote  interrupts,  support  for  atomic  hardware  operations,  re¬ 
mote  source-level  kernel  debugging,  and  loading  of  the  kernel  via  Ethernet. 

•  Core  support  for  the  Psyche  user  interface:  realms,  virtual  processors,  pro¬ 
tection  domains,  keys  and  access  lists,  software  interrupts,  and  protected  and 
optimized  invocation  of  realm  operations. 

•  Rudimentary  I/O  to  the  console  serial  device  and  remote  file  service  via  Eth¬ 
ernet. 

•  Mli.imal  user-level  tools:  a  simple  shell,  program  loader  and  name  server,  sup¬ 
port  for  command-line  argument  passing,  simple  handlers  for  software  inter¬ 
rupts,  and  standard  I/O  and  kernel  call  libraries. 

We  expect  our  work  on  Psyche  to  evolve  into  many  interrelated  projects.  We  are 
already  experimenting  with  novel  and  promising  approaches  to  memory  management, 
inter-node  communication  within  the  kernel  and  support  for  remote  debugging.  We 
are  working  to  develop  practical  techniques  to  maximize  locality  of  reference  through 
automatic  code  and  data  migration.  We  expect  our  future  efforts  to  include  work 
on  lightweight  process  structure,  implementation  and  evaluation  of  communication 
models,  and  parallel  language  design.  The  latter  subject  is  of  particular  interest.  We 
have  specifically  avoided  language  dependencies  in  the  design  of  the  Psyche  kernel.  It 
is  our  intent  that  many  languages,  with  widely  differing  process  and  communication 
models,  be  able  to  coexist  and  cooperate  on  a  Psyche  machine.  We  are  interested, 
however,  in  the  extent  to  which  the  Psyche  philosophy  itself  can  be  embodied  in  a 
programming  language. 

The  communications  facilities  of  a  language  enjoy  considerable  advantages  over  a 
simple  subroutine  library.  They  can  be  integrated  with  the  naming  and  type  structure 
of  the  language.  They  can  employ  alternative  syntax.  They  can  make  use  of  implicit 
context.  They  can  produce  language-level  exceptions.  For  us  the  question  is:  to  what 
extent  can  these  advantages  be  provided  without  insisting  on  a  single  communication 
model  at  language-design  time?  We  expect  these  questions  to  form  the  basis  of  future 
work. 
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The  Psyche  design  was  motivated  and  continues  to  be  driven  by  the  needs  of 
application  programs,  primarily  A I  applications.  Our  experiences  in  the  development 
of  individual  vision  programs  on  the  Butterfly  provided  the  lessons  upon  which  the 
Psyche  design  was  based.  We  successfully  used  the  active  vision  and  robotics  project 
as  a  vehicle  for  evaluating  the  Psyche  design  and  implementation. 

Our  laboratory  for  active  vision  and  robotics  assumes  a  hardware  configuration  in 
which  camera  output  is  fed  into  a  pipelined  image  processor  and  the  general-purpose 
multiprocessor  is  reserved  for  higher-level  planning  and  control.  Initially,  most  of 
these  higher-level  functions  were  performed  on  a  uniprocessor  Sun.  As  the  Psyche 
implementation  became  available,  some  of  these  functions  were  migrated  onto  the 
Buttterfly.  By  making  this  migration  an  explicit  part  of  the  development  process 
we  permitted  early  work  in  the  systems  and  application  domains  to  proceed  in  a 
semi-decoupled  fashion,  with  neither  on  the  other’s  critical  path.  The  success  of 
our  previous  efforts  in  operating  system  implementation  for  the  Butterfly  [Mellor- 
Crummey  et  al.  1987],  together  with  the  fact  that  Psyche  construction  is  now  well 
underway,  suggests  that  the  availability  of  the  operating  system  is  unlikely  to  be  a 
problem  in  later  phases  of  the  project. 

Research  in  this  direction  is  continuing,  with  further  hardware  support  provided 
by  an  NSF  IIP  grant.  Once  software  has  moved  to  the  Butterfly,  we  expect  our  higher- 
level  functions  to  involve  hundreds  of  parallel  threads  of  control.  Some  of  these  threads 
will  share  data  structures.  Others  will  interact  through  message  passing.  Some  will 
confine  their  activities  to  the  multiprocessor.  Others  will  interface  to  the  image 
processor  and  the  camera  and  robot  controls.  Those  that  share  data  are  likely  to 
differ  in  their  needs  for  synchronization  and  consistency. 


6  Other  Programming  Libraries  and  Utilities  for 
MIMD  Parallelism 

Several  low-level  communications  utilities  were  written  to  support  the  interaction  of 
parallel  image  processing  with  action.  Communication  between  the  embedded  con¬ 
troller  in  the  robot  arm  and  controlling  software  on  the  host  is  via  9600  baud  serial 
line.  On  top  of  the  serial  line  is  layered  a  reliable  data  link  protocol,  implemented 
under  Unix  as  a  tty  line  discipline  and  in  the  robot  controller  as  a  part  of  the  VAL 
execution  monitor.  Above  the  data  link  layer  is  a  protocol  supporting  multiple  logi¬ 
cal  channels  between  the  robot  and  the  host.  The  data  link  software  was  developed 
and  distributed  by  the  Electrical  Engineering  Department  at  Purdue  University.  The 
logical  channel  software  (BOTLIB)  was  inspired  by  an  analogous  interface  developed 
at  Purdue,  but  has  been  completely  re-engineered  at  Rochester  to  provide  more  flex¬ 
ibility  and  speed.  It  provides  routines  to  get  the  current  robot  location  in  terms  of 
standard  coordinates  or  joint  angles,  move  the  robot  to  a  specified  location  in  terms 
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of  standard  coordinates  or  joint  angles,  set  the  speed  of  the  robot,  and  to  set  the 
location  and  orientation  of  the  tool  tip.  The  software  is  organized  as  a  C  language 
library.  The  routines  described  above  can  be  called  from  the  application  program. 

An  alternate  C  library  (ROBOCOMM)  was  written  by  Brian  Yamauchi  for  use 
in  the  Juggler  project  (see  below).  ROBOCOMM  is  much  faster  than  the  BOTLIB 
package  since  it  does  not  use  the  multi-layered,  reliable  ISO-standard  structure  for 
communication. 

Work  in  these  areas  is  continuing  past  the  contract  period.  Connection  between 
the  Butterfly  serial  ports  and  the  robot  is  being  explored  by  Mark  Crovella,  who  is 
adding  Psyche  capabilities  to  manage  such  communications.  When  complete,  this 
facility  will  give  individual  Butterfly  processors  the  ability  to  communicate  directly 
with  the  robot. 

In  addition  to  CONSUL  [Baldwin  1989a, b],  Rochester  developed  several  compilers, 
program  libraries,  systems  utilities  for  communication,  and  file  systems.  The  results 
at  the  end  of  the  contract  period  span  a  broad  range  from  parallel  file  systems  through 
new  languages  for  expressing  parallel  computation.  Applications  packages  such  as  the 
current  version  of  the  neural  net  simulator  [Fanty  1986,  1988;  Goddard  et  al.  1989] 
and  the  image- processing  utilities  produced  throughout  the  contract  period  allow 
speedups  of  up  to  a  factor  of  100  over  single- workstation  implementations  [Olson  et  al 
1987,  Olson  1986b, c].  User  interfaces  to  large  multiprocessor  computers  are  a  difficult 
issue,  but  we  have  contributed  to  that  as  well  [Scott  and  Yap  1988;  Yap  and  Scott 
1990,  Olson  1986a]  and  we  are  still  working  to  extend  the  range  of  computational 
models  available  to  a  user.  For  instance  the  Ant  Farm  project  provides  the  basic 
capability  to  support  many  lightweight  processes. 

“An  Empirical  Study  of  Message-Passing  Overhead,”  by  M.  L.  Scott  and  A.  L. 
Cox,  appeared  at  the  7th  International  Conference  on  Distributed  Computing  Sys¬ 
tems  in  Berlin,  West  Germany  in  September  1987.  It  reports  on  efforts  to  optimize  the 
performance  of  the  LYNX  run-time  support  package  and  presents  a  detailed  break¬ 
down  of  costs  in  the  final  implementation.  This  breakdown  (1)  reveals  the  marginal 
cost  of  various  features  of  LYNX,  (2)  carries  important  implications  for  the  costs  of 
related  features  in  other  languages,  and  (3)  sets  an  example  for  similar  studies  in  other 
environments.  Other  work  in  this  important  effort  of  quantifying  parallel  behavior  is 
also  documented  in  [Floyd  1989;  LeBlanc  et  al.  1988;  LeBlanc  1988a,  1988b;  Scott 
and  Cox  1987]. 

The  “Ant  Farm”  library  package  was  used  to  develop  applications[Scott  and  Jones 
1989].  It  supports  extremely  large  numbers  (c.  25,000)  of  lightweight  processes  in 
Modula-2  with  location-transparent  communication. 

We  constructed  and  studied  the  performance  of  a  novel  operating  system  for  the 
Butterfly,  called  Elmwood.  “Elmwood-An  Object-Oriented  Multiprocessor  Operating 
System”  appeared  in  Software-Practice  and  Experience  [Mellor-Crummey  et  al.  1987; 
LeBlanc  et  al.  1989]. 
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“Crowd  Control:  Coordinating  Processes  in  Parallel”  by  T.J.  LeBlanc  and  S.  Jain 
appeared  in  the  Proc.  of  the  International  Conference  on  Parallel  Processing.  This 
paper  describes  a  library  package  for  the  Butterfly  that  can  be  used  to  create  a  parallel 
schedule  for  large  numbers  of  processes.  A  partial  order  is  imposed  on  the  execution 
based  on  an  arbitrary  embedding  of  processes  in  a  balanced  binary  tree  [LeBlanc  and 
Jain  1987]. 

Other  utilities  developed  over  the  contract  period  include  the  Bridge  file  system 
for  parallel  I/O,  by  Peter  Dibble  [Dibble  et  al.  1988;  Dibble  and  Scott  1989a, b],  the 
Platinum  and  Osmium  systems  for  automatically  resolving  cacheing  and  non-uniform 
reference  problems  in  SIMD-like  computations  [Fowler  and  Cox  1988a, b;  Cox  and 
Fowler  1989],  and  many  other  pieces  of  work  cited  in  the  references  [Olson  1986a, 
Mellor-Crummey  1987;  Gafter  1987,  1988;  Bolosky  1989]. 

Characteristics  of  several  programming  utilities  are  compared  in  Table  3,  which 
also  includes  some  well-known  programming  systems  for  NUMA  MIMD  computers 
such  as  the  Butterfly  available  commercially  (Uniform  System,  Emerald,  Linda).  This 
extensive  experience  in  implementing  and  analyzing  the  performance  of  parallel  pro¬ 
gramming  models  has  naturally  led  to  the  ideas  behind  the  Psyche  system  [Scott  and 
LeBlanc  1987;  Scott  et  al.  1988,  1989a, b,c,  1990]. 


7  Programming  Environments  for  MIMD  Paral¬ 
lelism 

A  major  portion  of  the  work  under  the  DARPA  contract  concentrated  on  problems  of 
monitoring  and  debugging  programs  for  the  parallel  vision  environment.  Rochester 
developed  many  tools  to  help  the  user  effectively  implement  parallel  algorithms  [e.g. 
LeBlanc  1989;  LeBlanc  et  al.  1990;  Mellor-Crummey  1988,  1989].  The  main  thrust 
has  been  the  construction  of  parallel  performance  monitoring  tools  and  experimenta¬ 
tion  with  the  use  of  these  tools  [e.g.  Fowler  and  Bella  1989;  Fowler  et  al.  1989]. 

One  of  the  most  serious  problems  in  the  development  cycle  of  large-scale  parallel 
programs  is  the  lack  of  tools  for  debugging  and  performance  analysis.  Three  issues 
complicate  parallel  program  analysis.  First,  parallel  programs  can  exhibit  nonrepeat- 
able  behavior,  limiting  the  effectiveness  of  traditional  cyclic  debugging  techniques. 
Second,  interactive  analysis,  frequently  employed  for  sequential  programs,  can  distort 
a  parallel  program’s  execution  behavior  beyond  recognition.  Third,  comprehensive 
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Package 

processes 

scheduling 

communication 

synchronization 

protection 

Uniform  System 

procedure  weight 

concurrent;  run 

to  completion 

shared  memory 

spin  locks, 
atomic  queues 

none 

Lynx 

one  per  address 
space;  multi¬ 
threaded 

processes  con¬ 
current;  threads 

run  until  blocked 

RPC 

implicit  in 
communication 

between 

processes 

SMP 

one  per  address 
space 

concurrent. 

preemptible 

non-blocking 

messages 

implicit  in 
communication 

between 

processes 

Ctaysalis++ 

one  per  address 
space 

concurrent, 

preemptible 

shared  memory, 
messages 

events, 

atomic  queues 

between 

processes 

Ant  Farm 

coroutine  weight, 
statically  located 

run  until  blocked 
within  a  processor 

shared memory 

events,  monitors, 
queues,  semaphores 

none 

MultiLisp 

coroutine -weight 

concurrent, 

preemptible 

shared  memory 

monitors;  implicit 
in  expression 
evaluation 

•  none 

Platinum 

multiple  per 
address  space; 
kernel  managed 

concurrent, 

preemptible 

shared  memory, 

messages 

s 

spin  locks; 
implicit  in 
communication 

between 

address 

spaces 

Elmwood 

multiple  per 
address  space; 
kernel  managed 

I 

concurrent,  pre¬ 
emptible;  move 
between  objects 

object  invocation; 
shared  memory 
within  objects 

implicit  m  invoca¬ 
tion;  semaphores 
and  conditions 
within  objects 

between 
address  spaces 
(objects) 

Emerald 

coroutine  weight 

concurrent,  pre¬ 
emptible;  move 
between  objects 

object  invocation; 
shared  memory 
within  objects 

implicit  in  invoca¬ 
tion;  monitors 
within  objects 

between  objects 
(compiler 
enforced) 

Linda 

unspecified 

concurrent, 

preemptible 

shared  assoc¬ 
iative  store 

implicit  in 

store  accesses 

unspecified; 
often  provided 

Table  3:  Programming  Systems  (six  developed  at  Rochester) 
for  NUMA  MIMD  computers. 
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analysis  of  a  parallel  program’s  execution  requires  collection,  management,  and  pre¬ 
sentation  of  an  enormous  amount  of  data.  Our  work  addressed  all  of  these  problems. 

Our  work  has  been  different  from  other  research  in  parallel  program  analysis  in 
two  key  respects.  First,  our  focus  was  on  large-scale,  shared-memory  multiproces¬ 
sors.  Second,  our  approach  integrated  debugging  and  performance  analysis,  using  a 
common  representation  of  program  executions. 

The  core  of  our  toolkit  consists  of  facilities  for  recording  execution  histories,  a 
common  user  interface  for  the  interactive,  graphical  manipulation  of  those  histories, 
and  tools  for  examining  and  manipulating  program  state  during  replay  of  a  previously 
recorded  execution.  These  facilities  form  a  foundation  upon  which  we  can  construct 
more  complex  tools  such  as  symbolic  debuggers,  execution  profilers,  and  performance 
analyzers. 

We  have  constructed  a  set  of  tools  for  instrumenting  parallel  programs  on  the 
Butterfly  for  performance  analysis.  Each  process  in  an  instrumented  program  records 
on  its  own  “history  tape”  each  of  its  interactions  with  shared  objects  including  the 
relative  timing  of  the  operations. 

An  execution  history  is  represented  naturally  as  a  directed  acyclic  graph  (DAG) 
of  process  interactions.  Nodes  in  the  graph  correspond  to  monitored  events  that 
took  place  during  execution.  Each  event  represents  an  operation  on  a  shared  object. 
Events  within  a  process  are  linked  by  arcs  denoting  a  temporal  relation  based  on 
a  local  time  scale.  Arcs  between  events  in  different  processes  denote  interprocess 
communication  and  synchronization. 

The  collection  of  history  tapes  from  the  individual  processes  can  be  combined  to 
give  a  consistent  view  of  the  execution  of  the  program  as  a  whole.  This  view  contains 
information  useful  for  identifying  critical  paths,  bottlenecks,  and  hot  spots  in  the 
program. 

An  execution  of  a  parallel  program  instrumented  for  performance  monitoring  gen¬ 
erates  a  massive  amount  of  data.  This  data  is  incomprehensible  in  its  raw  form 
so  we  developed  an  interactive  graphical  display  and  analysis  program  called  Movi¬ 
ola.  Moviola  features  a  flexible  user  interface  (graphics  and  LISP)  and  analytic  tools 
(critical  path  analysis). 

The  “streams”  package  part  of  the  NFS  (Network  File  System)  interface  to  the 
Butterfly  was  implemented.  Mellor-Crummey  produced  an  integrated  instrumenta¬ 
tion  package  that  extends  Instant  Replay  with  the  performance  monitoring  package. 
This  uses  the  streams  package  for  asynchronous  transfer  of  “history  data.” 

Using  Moviola  and  the  instrumentation  package,  we  experimented  with  their  use 
in  the  debugging  and  performance  analysis  and  tuning.  Mellor-Crummey’s  thesis 
demonstrated  their  effects  in  the  development  of  parallel  sorting  programs  [Mellor- 
Crummey  1989]. 
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7.1  Performance  Monitoring  and  Debugging 

Parallel  programming  requires  that  programmers  deal  with  new  and  unfamiliar  ab¬ 
stractions,  often  using  tools  designed  for  sequential  programs.  Debugging  is  compli¬ 
cated  by  parallelism  and  traditional  cyclic  techniques  for  debugging  may  not  help, 
since  many  parallel  programs  have  non-repeatable  behavior.  Program  profilers  are 
of  little  use  in  performance  tuning,  since  it  may  be  difficult  to  determine  the  impact 
of  an  individual  process  on  overall  performance,  the  effects  of  process  decomposi¬ 
tion,  and  the  outcome  of  specific  optimizations.  Tools  that  report  the  instantaneous 
level  of  parallelism  can  illustrate  how  well  the  program  is  performing,  but  provide  no 
guidance  on  how  to  improve  parallelism. 

For  the  past  four  years  we  have  been  developing  a  toolkit  for  debugging  and  per¬ 
formance  analysis  of  parallel  programs  on  large-scale  shared-memory  multiprocessors. 
Our  approach  is  to  use  program  replay  in  cyclic,  post-mortem  analysis.  Cyclic  debug¬ 
ging  assumes  that  experiments  are  interactive  and  repeatable,  and  that  all  relevant 
program  behavior  is  observable.  Unlike  other  work,  such  as  Behavioral  Abstraction 
or  PIE  in  which  monitoring  software  filters  relevant  information  during  execution,  we 
save  enough  information  to  reproduce  an  execution  for  detailed  analysis  off-line.  A 
distinguishing  characteristic  of  our  work  is  the  integration  of  debugging  and  perfor¬ 
mance  analysis,  based  on  a  common  underlying  representation  of  program  executions. 

In  parallel  program  analysis,  the  focus  of  concern  is  no  longer  simply  the  internal 
state  of  a  single  process,  but  must  include  internal  states  of  (potentially)  many  dif¬ 
ferent  processes  and  the  interactions  among  processes.  A  cyclic  methodology  can  still 
be  used,  but  four  issues  that  complicate  analysis  must  first  be  addressed:  (1)  parallel 
programs  often  exhibit  nonrepeatable  behavior,  (2)  interactive  analysis  can  distort  a 
parallel  program’s  execution,  (3)  analysis  of  large-scale  parallel  programs  requires  the 
collection,  management,  and  presentation  of  an  enormous  amount  of  data,  and  (4) 
the  execution  environment,  which  must  admit  extensive  parallelism,  and  the  analysis 
environment,  which  must  provide  a  single,  comprehensive  user-interface,  may  differ 
dramatically.  Our  research  is  currently  devoted  to  addressing  each  of  these  issues. 


7.2  Monitoring  Parallel  Programs 

Monitoring  parallel  programs  for  cyclic  debugging  requires  that  essential  information 
be  extracted  during  execution  to  allow  for  reproducible  experiments.  Unfortunately, 
parallel  programs  may  exhibit  timing-dependent  behavior  due  to  race  conditions  in 
synchronization  or  programmer  intervention  during  debugging.  To  allow  cyclic  de¬ 
bugging  and  reproducible  behavior  during  debugging,  the  monitoring  system  must 
capture  both  program  state  information  and  relative  timing  information. 

Severed  message-based  debuggers  have  been  developed  that  record  the  contents  of 
each  message  sent  in  the  system  in  an  event  log.  The  programmer  can  either  review 
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the  messages  in  the  log,  in  an  attempt  to  isolate  errors,  or  the  events  can  be  used  as 
input  to  replay  the  execution  of  a  process  in  isolation.  Experiments  with  executions 
can  be  reproduced  by  presenting  the  same  messages  to  each  process  in  the  proper 
sequence. 

Our  approach  to  monitoring  for  shared-memory  parallel  programs  is  based  on  a 
partial  order  of  accesses  to  shared  objects.  In  this  approach,  all  interactions  between 
processes  we  modeled  as  operations  on  shared  objects.  During  program  execution 
each  process  records  a  history  of  its  accesses  to  shared  objects,  collecting  a  trace 
of  all  synchronization  events  that  occur.  The  union  of  the  individual  process  histo¬ 
ries  specifies  a  partial  order  of  accesses  to  each  shared  object.  This  partial  order, 
together  with  the  source  code  and  input,  characterizes  an  execution  of  the  parallel 
program.  Since  an  execution  history  contains  only  synchronization  information,  it 
is  much  smaller  than  a  record  of  all  data  exchanged  between  processes,  making  it 
relatively  inexpensive  to  capture. 

In  addition  to  race  conditions,  other  nondeterministic  execution  properties,  such 
as  asynchronous  interrupts,  can  cause  nonreproducible  behavior.  We  have  developed  a 
software  instruction  counter  to  reproduce  these  events  during  program  replay  [Mellor- 
Crummey  and  LeBlanc  1989]. 


7.3  A  Toolkit  for  Parallel  Program  Analysis 

The  information  we  collect  during  program  monitoring  can  be  used  to  replay  a  pro¬ 
gram  during  the  debugging  cycle.  During  replay,  events  can  be  observed  at  any  level 
of  detail  and  controlled  experiments  can  be  performed.  More  important,  however,  is 
that  we  use  program  monitoring  to  create  a  representation  for  an  execution  that  can 
be  analyzed  by  our  programmable  toolkit. 

The  core  of  our  toolkit  consists  of  facilities  for  recording  execution  histories,  a 
common  user  interface  for  the  interactive,  graphical  manipulation  of  those  histories, 
and  tools  for  examining  and  manipulating  program  state  during  replay  of  a  previously 
recorded  execution.  The  user  interface  for  the  toolkit  resides  on  the  programmer’s 
workstation  and  consists  of  two  major  components:  an  interactive,  graphical  browser 
for  analyzing  execution  histories  and  a  programmable  Lisp  environment.  The  execu¬ 
tion  history  browser,  called  Moviola,  is  written  in  C  and  runs  under  the  X  Windows 
System. 

Moviola  implements  a  graphical  view  of  an  execution  based  on  a  DAG  represen¬ 
tation  of  processes  and  communication.  Moviola  gathers  process-local  histories  and 
combines  them  into  a  single,  global  execution  history  in  which  each  edge  represents  a 
temporal  relation  between  two  events.  In  a  Moviola  diagram,  time  flows  from  top  to 
bottom.  Events  that  occur  within  a  process  we  aligned  vertically,  forming  a  time-line 
for  that  process.  Edges  joining  events  in  different  processes  reflect  temporal  rela¬ 
tionships  resulting  from  synchronization.  Event  placement  is  determined  by  global 
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logical  time  computed  from  the  partial  order  of  events  collected  during  execution. 
Each  event  is  displayed  as  a  shaded  box  with  height  proportional  to  the  duration  of 
the  event  (e.g.  Fig.  6). 

Moviola’s  user  interface  provides  a  rich  set  of  operations  to  control  the  graphical 
display.  Several  interactive  mechanisms,  including  independent  scaling  in  two  dimen¬ 
sions,  zoom,  and  smooth  panning,  allow  the  programmer  to  concentrate  on  interesting 
portions  of  the  graph.  Individual  events  can  be  selected  for  analysis  using  the  mouse; 
the  user  has  control  over  the  amount  and  type  of  data  displayed  for  selected  events. 
The  user  can  also  control  which  processes  are  displayed  and  how  they  are  displayed. 
By  choosing  to  display  dependencies  for  a  subset  of  the  shared  objects,  screen  clutter 
can  be  reduced. 

Many  different  analyses  are  possible  based  on  this  graphical  view  of  an  execution, 
but  the  sheer  size  of  an  execution  history  graph  makes  it  impractical  to  base  ail 
analyses  on  manual  manipulation  of  the  graph.  Extensibility  and  programmability 
are  provided  by  running  all  tools  under  the  aegis  of  Common  Lisp.  Tools  can  take 
the  form  of  interpreted  Lisp,  compiled  Lisp,  or,  like  Moviola,  foreign  code  loaded 
into  the  Lisp  environment.  Our  programmable  interface  enables  a  user  to  write  Lisp 
code  to  traverse  the  execution  graph  built  by  Moviola  to  gather  detailed,  application- 
specific  performance  statistics.  The  programmable  interface  is  especially  useful  for 
performing  well-defined,  repetitive  tasks,  such  as  gathering  the  mean  and  standard 
deviation  of  the  time  it  takes  processes  to  execute  parts  of  their  computation,  or  how 
much  waiting  a  process  performs  during  each  stage  of  a  computation. 

The  programmable  interface  can  also  be  used  to  create  different  views  of  an  exe¬ 
cution.  We  might  want  to  use  program  animation  to  analyze  dynamic  activity  over 
static  communication  channels,  or  application-specific  views  to  describe  the  progress 
of  a  computation  in  terms  of  the  program,  rather  than  the  low-level  view  provided 
by  Moviola.  For  performance  analysis,  the  performance  graphs  produced  by  PIE  or 
SeeCube  are  much  more  effective  than  a  synchronization  DAG.  Our  current  work  is 
using  the  programmable  interface  to  extend  the  range  of  views  for  an  execution  avail¬ 
able  to  users,  from  application-specific  views  to  detailed  performance  graphs  (Fig.  7). 

We  have  already  constructed  a  mechanism  for  remote,  source-level  debugging  for 
Psyche,  in  the  style  of  the  Topaz  TeleDebug  facility  developed  at  DEC  SRC.  An 
interactive  front  end  runs  on  a  Sun  workstation  using  the  GNU  gdb  debugger.  The 
debugger  communicates  via  UDP  with  a  multiplexor  running  on  the  Butterfly’s  host 
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machine.  The  multiplexor  in  turn  communicates  with  a  low-level  debugging  stub  (lid) 
that  underlies  the  Psyche  kernel. 

We  have  successfully  used  this  facility  for  kernel  debugging  and  plan  to  use  it  as 
a  base  for  user-level,  multi-model  debugging.  Low-level  debugger  functions  will  be 
implemented  by  a  combination  of  gdb  and  lid.  High-level  commands  from  the  user 
will  be  translated  by  a  model-specific  interface,  created  as  part  of  the  programming 
model. 

In  addition,  debugger  stubs  have  been  implemented  to  enable  complex  debugger 
queries  and  conditional  breakpoints  during  execution.  The  toolkit  has  been  integrated 
with  an  extended  version  of  the  gdb  debugger,  enabling  source  level  debugging  during 
replay  of  multiprocess  programs.  The  Moviola  graphical  interface  has  been  improved, 
significantly  reducing  the  display  time  and  increasing  the  functionality.  The  S  graph¬ 
ics  package  has  been  added  to  the  toolkit,  facilitating  graphical  displays  of  perfor¬ 
mance  data.  LISP  tools  have  been  written  for  critical  path  analysis  and  for  gathering 
and  plotting  performance  statistics.  All  displays  in  the  toolkit  can  be  reproduced  as 
hardcopy  using  Postscript  format. 

We  are  beginning  to  explore  the  relationship  between  program  analysis,  program¬ 
ming  model  (process  and  communi cation  semantics),  and  visualization.  We  are  in¬ 
vestigating  techniques  that  can  be  used  across  several  parallel  programming  models, 
and  a  tool  interface  that  allows  a  programmer  to  debug  using  the  primitives  provided 
by  a  particular  programming  model.  Our  goal  is  to  (a)  provide  a  framework  that 
unifies  our  approach,  as  embodied  in  our  toolkit,  with  the  various  techniques  for  pro¬ 
gram  monitoring  and  visualization  that  have  been  described  in  the  literature  and  (b) 
develop  a  methodology,  and  corresponding  tools,  for  parallel  program  analysis  that 
can  be  used  step-by-step  by  programmers  for  the  entire  software  development  cycle, 
from  initial  debugging  to  performance  modeling  and  extrapolation. 


8  Programming  Environments  for  Pipelined  Par¬ 
allel  Vision:  Zebra  and  Zed 

Under  the  DARPA  contract,  Rochester  developed  an  object  oriented  programming 
interface  to  Datacubes  Max  Video  family  of  image  processing  boards.  The  system  is 
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Figure  7:  A  perspective  plot  of  communication  time  per  process  per  row  for  Gaussian 
Elimination,  as  produced  by  the  toolkit.  The  x-axis  corresponds  to  the  36  processes 
in  the  computation,  the  y-axis  corresponds  to  rounds  of  communication,  one  per  pivot 
row,  and  the  z-axis  is  communication  time  for  a  round.  The  plot  shows  the  increase 
in  communication  time  (caused  by  contention)  as  the  computation  progresses. 


called  Zebra.  Zebra  is  not  simply  a  packaged  version  of  the  standard  Maxware  calls, 
but  rather  a  different  style  of  programming  for  the  Datacube  hardware. 

The  basic  philosophy  of  Zebra  is  two-fold.  First,  each  board  type  is  represented 
by  an  object  class.  Each  physical  MaxVideo  board  is  represented  by  an  instance  of 
its  class.  Simply  by  declaring  the  board  objects  as  variables,  the  boards  are  opened 
and  initialized.  Second,  Zebra  takes  a  microprogramming-like  approach  to  control¬ 
ling  Datacube  boards.  The  register  set  for  each  board  is  considered  to  be  a  micro¬ 
instruction  word.  This  instruction  word  completely  specifies  a  board  configuration. 
By  sending  instruction  words  to  boards,  the  hardware  can  be  completely  programmed 
in  a  microprogramming-like  manner. 

The  nature  of  applications  code  becomes  largely  different  from  that  of  Maxware 
counterparts.  The  configuration  of  MaxVideo  boards  is  not  represented  in  the  call  se¬ 
quence  of  the  application  program  but  rather  in  a  text  file  which  may  be  changed  with¬ 
out  recompiling  the  application  program.  Thus  the  development  process  is  stream¬ 
lined  by  requiring  fewer  compilations. 

Instruction  words  can  be  stored  in  and  retrieved  from  files,  allowing  the  sharing 
of  standard  configurations  between  developers.  Instruction  words  are  created  an 
modified  via  an  instruction  word  editor.  One  such  editor  "Zed”  is  provided  with 
Zebra. 

Zed  allows  a  programmer  to  create  a  new  instruction  word  or  modify  an  existing 
one  directly  from  the  keyboard.  This  instruction  word  may  then  be  saved  in  a  file 
or  loaded  directly  into  a  physical  board  for  testing.  This  allows  rapid  prototyping  of 
board  configurations. 

Some  details  of  Zebra  are  the  following: 

•  It  is  object  oriented  and  written  in  C++:  It  encapsulates  each  board  as  an  ob¬ 
ject,  created  and  initialized  upon  declaration,  that  can  be  updated  and  queried. 

•  It  leads  to  far  less  complicated  applications  code  than  Maxware. 

•  It  uses  explicit  human-  and  program-  read/writeable  board  descriptions,  which 
are  a  succinct  and  stable  way  to  store,  access,  re-use,  and  share  board  configu¬ 
rations. 

•  It  is  not  based  on  any  other  interface  software  (it  does  not  use  Maxware  or  the 
Datacube  device  driver,  for  instance). 

•  It  already  runs  on  two  dissimilar  architectures  at  UR  (the  BBN-ACI  Butterfly 
Parallel  Processor  and  Suns).  It  only  assumes  a  memory-map  operating  system 
call  and  so  is  highly  portable  between  host  architectures. 

Rochester  has  also  developed  Zed,  which  is  released  with  Zebra.  Zed  has  the 
following  characteristics: 
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•  It  is  an  illustrative  Zebra  application. 

•  It  provides  an  interactive,  menu-based  interface  for  board  configuration,  editing, 
and  experimentation. 

•  It  runs  on  any  standard  terminal,  and  under  Suntools  and  X-windows. 

•  It  allows  new  users  to  begin  using  Datacube  hardware  in  minutes. 

The  following  example  Zebra  program  uses  the  P3  bus  to  implement  a  full-frame 
continuous  transfer  of  image  data  from  Digimax  to  a  ROI-Store  512,  back  to  Digimax, 
and  up  onto  a  monitor. 

mainO 

{ 

//  create  and  init  the  boards 

1  dgBoard  digimax(DG_00_BASE,  DG_00_IVEC,  "Zdglnit .zff ") ; 

2  rsBoard  rs0(RS_00_RBASE,  RS.OO.MBASE,  RS_M512,  RS.OO.IVEC, 

"ZrsCont512 . zff") ; 

//  fire  the  transfer 

3  rsO .fire (RS .READ) ; 

4  rsO .fire (RS .WRITE)  ; 

> 

Line  one  declares  an  object  of  class  dgBoard  with  the  name  digimax.  This  opens 
the  board  specified  at  VME  address  ”DG_00_BASE”,  and  initializes  the  board  with 
the  configuration  in  file  ”ZdgInit.zfF\  Line  two  similarly  declares  a  roistore  board 
object.  Lines  three  and  four  are  analogous  to  Maxware  rsRFire  and  rsWFire  respec¬ 
tively.  Note  that  to  change  this  program  to  do  a  singleshot  “snapshot”  transfer,  the 
configuration  file  can  be  changed  without  recompiling  the  program.  Alternatively  a 
different  configuration  file  can  be  used.  Zebra  and  Zed  are  available  free  of  charge  by 
anonymous  FTP  from  CS. Rochester. Edu. 


9  Parallel  Vision  Applications 

Although  the  focus  of  the  contract  was  on  developing  a  programming  environment, 
Rochester  also  did  parallel  vision  applications  as  a  test  and  a  driving  force  for  the 
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systems  development.  This  section  briefly  outlines  some  of  the  more  influential  of  the 
projects:  more  details  are  available  in  the  literature  [e.g.  Brown  et  al.  1985;  1988]. 


9.1  SIMD-style  Low-level  Vision  on  the  Butterfly 

Rochester  participated  in  the  first  DARPA  benchmark  study.  One  aspect  of  that 
work  motivated  much  of  our  current  research  in  multi-model  parallel  programming 
environments  and  performance  modeling  tools.  The  other  aspect  was  a  successful 
demonstration  that  SIMD-style  (data-parallel)  low-level  vision  applications  could  be 
performed  on  an  MIMD  computer.  Fig.  8  shows  some  results  for  border-following. 
Extensive  analysis  and  demonstration  programs  for  multi- resolution  image  pyramid 
generation,  line-finding,  connected  component  analysis,  and  the  Hough  transform 
were  also  developed  [Brown  1986;  Olson  1986b, c;  Olson  et  al  1987]. 


9.2  Parallel  Object  Recognition 

Paul  Cooper  and  Michael  Swain  cooperated  to  investigate  object  recognition,  based 
on  object  relational  structure  and  some  geometry,  from  a  large  database.  This  work 
was  based  in  connectionist,  massively  parallel  framework,  and  led  to  hardware  (VLSI 
circuit)  designs  and  implementations  on  the  connection  machine  at  NPAC  in  Syracuse, 
NY  [Cooper  1988,  1989;  Cooper  and  Swain  1988,  1989;  Swain  and  Cooper  1988;  Swain 
1988]. 

9.3  Cooperating  Intrinsic  Image  Calculations 

John  Aloimonos  took  a  mathematical  approach  in  his  thesis  to  unifying  several  dis¬ 
parate  results  on  extracting  physical  attributes  from  images  [Aloimonos  et  al.  1985, 
Aloimonos  1986;  Aloimonos  and  Brown  19841, b,  1988,  1989;  Aloimonos  and  Swain 
1985;  Brown  et  al.  1987,  etc.].  The  state  of  knowledge  when  be  started  is  shown  in 
Fig.  9. 

As  a  result  of  his  work,  mathematical  constraints  were  developed  to  allow  these 
calculations  to  be  combined  to  produce  more  robust  results  with  less  restrictive  as¬ 
sumptions.  This  work  is  reported  in  his  recent  book  Integration  of  Visual  Modules , 
written  with  Dave  Schulman,  and  summarized  in  Fig.  10. 


43 


Miff.  Potiio.  Miyhew. 


Figure  9:  Previous  work  on  intrinsic  image  calculation. 


43b 


The  characteristics  of  well-known  visual  problems  are  radically  changed  by  this 
approach  (Fig.  11),  which  yields  robust,  linear  solutions  with  fewer  assumptions. 

9.4  Markov  Random  Fields  and  Massively  Parallel  IU 

In  their  thesis  work,  Dave  Sher  and  Paul  Chou  pursued  a  probabilistic  approach  to 
image  understanding,  which  could  be  implemented  as  a  Markov  Random  Field  [Sher 
1987a, b,c;  Chou  1988,  Chou  and  Brown  1987a, b,  Chou  et  al.  1987;  etc.].  Image 
understanding  then  takes  the  form  of  labelling  individual  pixels  or  features  in  the 
image  with  properties  such  as  “boundary”,  “no  boundary”,  or  a  depth  value.  This 
approach  allows  for  a  uniform  and  real-time  evidence  combination  algorithm  for  multi¬ 
sensor  fusion  and  a  parallelizable  algorithm  for  the  labelling.  Using  this  approach, 
the  reconstructionist  visual  approach  that  tries  to  create  depth  maps  from  images  is 
integrated  with  the  solution  to  the  segmentation  problem,  which  identifies  boundaries 
and  objects  within  the  scene.  Chou  developed  the  Highest  Confidence  First  algorithm 
for  labelling.  Chou  made  quantitative  comparison  between  several  known  Markov 
Random  Field  algorithms,  and  HCF  was  shown  to  be  a  superior  method  to  all  those 
known  at  the  time.  HCF  is  inherently  sequential.  Later  work  at  Rochester  by  Swain 
and  Wixson  parallelized  the  algorithm  for  the  Butterfly,  with  improved  qualitative, 
quantitative,  (and  of  course  timing)  results  [Swain  and  Wixson  1989,  Swain  et  al. 
1989]. 

Fig.  12  shows  the  performance  of  HCF  on  a  boundary-detection  task.  Fig.  13 
shows  the  results  of  combining  sparse  depth  measurements  with  intensity  data  to 
produce  a  depth  map  of  the  scene  and  a  boundary  map  simultaneously. 


9.5  Pipelined  Parallelism  and  Real-time  Object  Search 

A  good  example  of  the  cooperation  of  real-time  vision  processing  and  a  mobile  ob¬ 
server  is  provided  by  Rochester’s  program  of  work  on  fast  object  detection,  which 
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Fi,ure  10:  Some  of  Aioimonos's  contributions  in  cooperating 

intrinsic  image  calculation. 
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Problem 

Passive  Observer 

Active  Observer  ! 

Shape  from  shading 

Ill-posed  problem.  Needs 
to  be  regularized.  Even 
then,  unique  solution  is  not 
guaranteed  because  of  non¬ 
linearity. 

Well-posed  problem. 

Unique  solution.  Linear 
equation  used.  Stability. 

Shape  from  contour 

Ill-posed  problem.  Has  not 
been  regularized  up  to  now 
in  the  Tichonov  sense. 
Solvable  under  restrictive 
assumptions. 

Well-posed  problem. 

Unique  solution  for  both 
monocular  or  binocular  ob¬ 
server. 

Shape  from  texture 

Ill-posed  problem.  Needs 
some  assumption  about  the 
texture. 

Well  posed  problem.  No 
assumption  required. 

Structure  from  motion 

Well  posed  but  unstable. 
Nonlinear  constraints. 

Weil  posed  and  stable. 
Quadratic  constraints,  sim¬ 
ple  solution  methods,  sta¬ 
bility. 

Optic  flow  (area 
based) 

Ill- posed.  Needs  to  be  reg¬ 
ularized.  The  introduced 
smoothness  might  produce 
erroneous  results. 

Well  posed  problem. 

Unique  solution.  Might  be 
unstable. 

Figure  11:  Combining  constraints  gives  better  solutions 
for  vision  problems. 
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uses  relational  modeling,  and  reasoning  about  occlusion. 

The  ability  to  find  a  certain  object  in  an  unknown  environment  is  a  component  of 
many  real-world  problems  that  a  general-purpose  robot  might  face.  Lambert  Wixson 
studied  this  visual  task,  object  search.  His  research  is  divided  into  three  areas,  all  of 
which  attack  the  key  problem  of  robustly  finding  the  object  in  the  smallest  possible 
time. 

The  first  is  the  problem  of  object  recognition.  Most  research  on  model-based 
object  recognition  from  a  single  camera  has  concentrated  on  robustness.  While  this 
is  obviously  an  important  first  step,  the  object  search  task  brutally  illustrates  that 
speed  is  just  as  important.  Almost  all  current  object  recognition  schemes  require 
that  image  features  be  matched  to  model  features,  requiring  a  time  polynomial  in 
the  number  of  features  to  perform  the  matching.  This  polynomial  time  is  a  result  of 
having  to  match  the  image  features  to  the  model  features  in  order  to  calculate  and 
refine  the  pose  estimate  of  the  object  in  the  scene.  By  adding  an  initial  stage  that 
does  not  perform  pose  calculation  but  rather  simply  detects  the  likely  presence  of 
the  object  in  the  image,  considerable  efficiency  can  be  gained.  The  idea  is  that  this 
initial  stage  would  be  used  to  rank  each  gaze  in  a  set  of  candidate  gazes  according  to 
the  likelihood  that  the  image  produced  by  the  gaze  contains  the  desired  object.  This 
ranking  can  then  be  used  to  choose  the  order  in  which  a  more  sophisticated  object 
recognition  program  (which  would  calculate  pose)  should  be  applied  to  the  candidate 
images. 

Wixson  [Wixson  and  Ballard  1989]  constructed  an  object  detection  scheme  that 
relies  on  the  assumption  that  the  color  histogram  of  an  object  can  be  used  as  an 
object  “signature”  which  is  invariant  over  a  wide  range  of  scenes  and  object  poses. 
The  color  histogram  is  computed  at  3Hz  by  the  Datacube  hardware  and  the  matching 
compares  18  database  items  to  a  histogram  in  one  second.  Counting  time  to  move  the 
robot  to  a  new  gaze  position  (one  and  one-half  seconds  per  move),  each  gaze  can  be 
evaluated  for  its  object  content  in  just  under  3  seconds.  Fig.  14  shows  some  sample 
results. 

The  second  area  of  object  search  is  the  use  of  high-level  knowledge  of  common 
relationships  and  interactions  between  objects  (t.e.the  contexts  in  which  certain  ob¬ 
jects  typically  appear)  to  direct  the  search  process  [Wixson  to  appear].  For  example, 
if  the  robot  is  looking  for  a  pen,  it  might  be  wise  to  search  for  a  desk  first,  referred  to 
this  use  of  high-level  knowledge  as  indirect  search.  Our  approach  formulates  indirect 
search  using  a  finite  set  of  relationships  (FRONT-OF,  NEAR,  LEFT-OF,  etc.)  be¬ 
tween  objects.  The  relationships  may  be  known  apriori  or,  more  interestingly,  derived 
from  experience  with  the  scene.  Initially  objects  will  be  represented  as  a  (perhaps 
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Figure\  13 :  Experiments  with  Stereo  Disparity  Data  (II) 


a)  200  by  200  intensity  image,  b)  Locations  of  the  disparity  measurements  overlaid 
with  the  TLR  estimate  of  the  intensity  discontinuities,  c)  Input  disparity  image,  d) 
Reconstructed  disparity  map.  e)  Disparity  discontinuities. 
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partial)  local  coordinate  system  (a  circularly  symmetric  object  might  only  have  a  Z 
axis  and  origin,  for  example)  and  a  feature  vector.  Characterizing  the  occurrence  of 
relationships  as  Bernoulli  trials  leads  to  a  confidence  interval  representation  of  the 
probability  of  the  relations  holding.  In  turn,  these  probabilities  can  be  used  in  a 
“highest  impact  first”  search  that  acquires  information  in  the  order  that  maximally 
decreases  expected  uncertainty.  The  result  is  to  derive  Garvey-like  strategies  on  the 
fly,  with  learning,  and  from  first  principles. 

The  third  area  of  object  search  involves  reasoning  about  obstacles  and  occlusion 
to  the  extent  that  they  affect  the  task  of  finding  the  desired  object.  This  research 
is  in  progress.  We  would  like  a  system  which  can  reason,  for  example,  that  since  it 
hasn’t  yet  seen  the  object,  but  the  area  under  the  desk  has  not  been  examined,  then 
this  area  should  be  examined.  Many  issues  are  present  in  this  problem.  The  largest  is 
the  choice  of  a  world  representation  which  cam  support  this  reasoning  without  being 
computationally  problematic.  The  reasoning  and  world  modeling  must  also  be  robust 
to  sensor  noise  and  marginal  errors  in  the  depth  estimation  process  used  to  detect 
occlusions  in  the  scene. 

Wixson’s  work  assumed  a  solution  to  the  object  recognition  problem.  Mike  Swain 
investigated  color  cues  for  object  recognition  [Swain  1988a, b].  Fig.  15  shows  19 
pairs  of  images  (the  originals  are  colored):  on  the  left  of  each  pair  is  a  catalog  entry, 
on  the  right  an  instance  from  a  read  scene.  16  shows  confusion  matrices  for  the  19 
image  instances  recognized  from  their  catalog  descriptions.  The  instance  views  have 
different  viewing  angles  from  those  that  generated  the  catalog.  The  basic  description 
is  a  color  histogram  and  a  saliency  measure  subtracts  histogram  features  common 
to  the  ensemble,  thus  weighting  more  heavily  the  features  that  are  unique  to  each 
object. 
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Figure  14:  (a)  Top  view  of  the  laboratory  environment  for  a  typical  test  run  showing 
the  direction  (but  not  the  distance)  of  each  object  with  respect  to  the  robot. 


Figure  14b 

Gaze  directions  produced  by  the  object  search  mechanism  for  the  “Clorox”  and  “All” 
detergent  boxes.  Area  of  circle  is  proportional  to  the  confidence  of  detection  in  that 
gaze.  Numbers  next  to  circles  reflect  the  ordering  of  the  confidences  in  decreasing 
order.  The  dashed  lines  in  each  circle  are  merely  to  provide  reference  points. 

46a 


Figure  15:  Black  and  white  reproduction  of  color  originals  of  (catalog,  instance)  image 
pairs. 


9.6  Gaze  Control 

In  research  carried  out  at  Oxford, Chris  Brown  did  work  on  Kalman  filters  for  track¬ 
ing  applications  (reported  in  the  DARPA  IU  Proceedings),  on  projectively  invariant 
matching  of  geometric  structures  in  images  (reported  in  the  European  Vision  Confer¬ 
ence),  and  on  control  of  Rochester’s  robot  head.  [Brown  1989b,c;  1990a,b]. 

The  work  investigated  predictive  mechanisms  to  solve  problems  of  cooperation 
and  delay.  “Subsumption”  architectures  like  those  of  Brooks  and  Connell  find  these 
problems  troublesome  since  internal  state  representations  are  minimized,  control  in¬ 
teraction  is  usually  limited  to  preemption,  and  actions  are  synchronized  only  through 
the  outside  world. 

The  work  developed  eight  camera  controls  and  investigated  their  interaction.  It 
showed  that  predictive  techniques  can  overcome  the  catastrophic  effects  of  delays  and 
interactions.  It  made  comparisons  with  primate  gaze  controls  and  with  an  open-loop 
approach  to  delay.  Tracking,  gaze  shifts,  and  vergence  controls  used  three  dimen¬ 
sional,  not  retinal,  coordinates.  Optimal  estimation  techniques  were  used  to  estimate 
and  predict  the  dynamic  properties  of  the  target. 

The  control  algorithms  axe  run  in  a  simulation  that  is  meant  to  be  general  and 
flexible,  but  especially  to  capture  the  relevant  aspects  of  the  Rochester  Robot.  Previ¬ 
ous  work  with  the  Rochester  Robot  had  already  produced  several  implementations  of 
potential  basic  components  of  a  real-time  gaze-control  system. These  components  in¬ 
cluded  basic  capabilities  of  target  tracking,  rapid  gaze  shifts,  gaze  stabilization  against 
head  motion,  verging  the  cameras,  binocular  stereo,  optic  flow  and  kinetic  depth  cal¬ 
culations.  These  separate  capabilities  do  not  yet  cooperate  to  accomplish  tasks.  The 
work  at  Oxford  was  partly  motivated  by  the  need  to  integrate  several  capabilities 
smoothly  for  a  range  of  tasks  useful  for  perception,  navigation,  manipulation,  and  in 
general  “survival”. 

There  are  four  main  coordinate  systems  of  interest  in  this  work:  LAB,  HEAD, 
and  (left  and  right)  camera  and  retinal  (Fig.  17).  The  LAB,  HEAD,  and  camera 
systems  are  three-dimensional,  right-handed  and  orthogonal.  The  retinal  system  is 
two-dimensional  and  orthogonal.  LAB  is  rigidly  attached  to  the  environment  in  which 
the  animate  system  and  objects  move.  HEAD  is  rigidly  attached  to  the  head,  and 
(for  this  work)  has  three  rotational  and  three  translational  degrees  of  freedom.  The 
camera  systems  are  rigidly  attached  to  the  cameras  and  have  independent  pan  and 
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Figure  16:  Color  recognition  confusion  matrices  for  pairs  in  previous  figure  (considered 
left  to  right  within  top  to  bottom.)  (a)  Without  saliency  weighting  on  features.  In 
this  case,  the  ranks  of  the  correct  choice  are  as  follows  (they  should  be  identically  1): 
2141121111111111111.  (b)  With  saliency  weighting,  the  correct  choices 
uniformly  rank  first. 


Table  4:  Eye  and  head  control  summary.  The  ALT.  INPUT  column  shows  alternate 
forms  of  input,  (x,  y )  are  image  coordinates,  ( X ,  Y ,  Z)  are  world  coordinates,  ( Rx ,  Ry ) 
are  head  rotation  angles.  A  design  issue  is  whether  fast  gaze  shifts  and  tracking  are 
performed  only  by  the  “dominant  eye”  camera  or  by  both  cameras.  Likewise  vergence 
can  affect  both  cameras  or  the  non- dominant  camera. 

a  shared  tilt  degree  of  freedom.  The  retinal  systems  represent  image  coordinates 
resulting  from  perspective  projection  of  the  visible  world.  The  cameras  axe  supported 
on  a  kinematic  chain  so  that  their  principal  points  do  not  in  general  lie  on  any  head 
rotation,  pan,  or  tilt  axis. 

The  simulated  system  controls  are  summarized  in  Table  4.  Our  purpose  was  to 
investigate,  with  some  flexibility,  the  interactions  of  various  forms  of  basic  camera  and 
head  controls.  The  controls  axe  not  meant  to  model  those  of  any  biological  system. 
Rather  the  goal  was  to  build  a  system  with  sufficient  functionality  to  exhibit  many 
control  interactions.  The  interaction  of  a  subset  of  these  controls  on  target  tracking 
and  acquisition  tasks  (the  “smooth  pursuit”  and  “saccadic”  systems)  was  investigated 
and  was  used  to  illustrate  the  effects  of  different  control  algorithms  for  coping  with 
delays. 

Fig.  18.  shows  five  of  the  control  systems.  These  controls  can  act  together 
(Fig.  19)  to  achieve  different  complex  visual  tasks  such  as  quick  target  acquisition 
and  then  tracking  (Fig.  20).  Extending  the  control  system  to  deal  with  delays 
requires  kinematic  simulation  of  the  head  and  dynamic  simulation  of  the  outside 
world  (Fig.  21). 


9.7  Parallel  Cooperating  Agents  and  Juggler 

Our  first  robotics  application,  a  balloon  bouncing  program  called  Juggler,  successfully 
ran  in  November  1989  [Yamauchi  1989].  This  application  combines  binocular  camera 
input,  a  pipelined  image  processor,  and  a  6-degree-of-freedom  robot  arm  (with  a 
squash  racquet  attached)  to  bounce  a  balloon.  The  implementation  uses  a  competing 
agent  model  of  motor  control;  five  processes  compete  with  each  other  for  access  to  the 
robot  arm  to  position  the  balloon  in  the  visual  field,  to  position  the  racquet  under 
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CONTROL 

INPUT 

ALT.  INPUT 

OUT  | 

EYE 

Gaze  Shift 

Target  (x,y),(i,y) 

L.  Pan,  Tilt  vel. 

Track 

Target  (z,y) 

Target  (z,y) 

L.  Pan,  Tilt  vel. 

Gaze  Stabilize 

Head  Origin  /Jr,  Ry,  X, 

L.  Pan,  Tilt  vel. 

Vergence 

Horiz.  Disparity 

R.  Pan  vel. 

Virtual  Position 

target  ( X%Y,Z ) 

L.  Pan,  Tilt  vel. 

HEAD 

_ i 

Compensate 

Eye  Pans,  Tilt 

Fast  Head  Rotate 

Target  (X,YtZ) 

K3Eg— , 

Virtual  Position 

Target  (X,  Y,  Z) 

KLS1SM  1 

Eye  and  head  control  summary.  The  ALT.  INPUT  column  shows  alternate  forms 
of  input.  (z,y)  are  image  coordinates,  (X,Y,Z)  are  world  coordinates,  (Rx>Ry)  are  head 
rotation  angles.  A  design  issue  is  whether  fast  gaze  shifts  and  tracking  are  performed 
only  by  the  dominant  eye  or  by  both  eyes.  Likewise  vergence  can  affect  both  eyes  or  the 
non-dominant  eye. 


Table  4 


DIRECTION 


(a)  RAPID  GAZE  SHIFT 


(b)  TRACKING 


(c)  VERGENCE 


(d)  GAZE  STABILIZATION 


(c)  HEAD  COMPENSATION 

Figure  18:  Five  representative  head  and  camera  controls. 
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Figure  20:  Increasingly  effective  delay-free  control  results  from  superposition  of  non¬ 
interacting  controllers.  Left  and  right  pan  and  tilt  angular  errors  in  gaze  direction 
(in  radians)  are  plotted  against  time.  The  hollow  square  shows  left  camera  pern  er¬ 
ror,  the  butterfly  right  camera  pan  error,  and  the  dark  square  and  hourglass  show 
left  and  right  tilt  errors,  in  (a)  tracking  reflex  only  (one  dominant  eye,  mechanical 
stops  are  hit)  (b)  adding  vergence  and  head  compensation  destabilizes  the  system 
(c)  adding  vestibulo-ocular  (gaze  stabilization)  reflex  stabilizes  system  and  tracking 
proceeds  faultlessly. 

the  balloon,  and  to  hit  the  balloon. 

Each  application  process  is  allocated  a  physical  processor,  so  scheduling  is  not  a 
concern.  Juggler  is  robust  because  even  if  processes  had  to  share  processors,  failure 
to  execute  any  one  process  during  a  particular  time  interval  would  have  little  if  any 
affect  on  behavior;  in  the  competing  agent  model,  each  application  process  continually 
broadcasts  commands  to  the  robot  in  competition  with  other  processes. 

Juggler  was  a  first  attempt  to  integrate  our  operating  systems  efforts  with  the 
development  of  applications.  As  a  result  of  our  experiences  with  Juggler,  we  are 
making  appropriate  extensions  to  Psyche  and  communications  capabilities,  and  we 
have  begun  to  experiment  with  user-level  scheduling. 


9.8  The  Workbench  for  Active  Vision  Experimentation 

The  Workbench  for  Active  Vision  Experimentation  (WAVE)  has  been  an  ongoing 
effort  since  the  summer  of  1988.  Its  purpose  is  to  provide  a  uniform  and  general 
purpose  platform  for  experimental  verification  of  our  research  [Brown  1988a, b;  Rimey 
1990]. 

WAVE  essentially  was  the  first  effort  to  “integrate  everything  in  the  Lab”.  The 
original  goals  were  to  build  a  system  which  causes  the  Puma  robot  to  visually  explore 
its  environment  for  racquet  balls  randomly  hanging  from  the  Lab  ceiling  and  also 
to  produce  an  accompanying  repertory  of  simple  modular  behaviors  and  capabilities. 
In  this  system  the  Robot  first  moves  to  scan  the  entire  Lab  and  locates  each  ball 
using  binary  image  analysis  and  stereo  vision.  Next  the  Robot  moves  around  a  ball 
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Figure  19:  The  interaction  of  the  independent  controls 
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Figure  20  Increasingly  effective  delay-free  control  results 

from  superposition  of  noninteracting  controllers.  In  this 
and  Figures  3,  4  and  5,  left  and  right  pan  and  tilt  angular 
errors  of  gaze  direction  (in  radians)  are  plotted  against 
time  in  ticks  (see  text).  The  hollow  square  always  shows 
left  camera  pan  error,  the  butterfly  shows  right  camera 
pan  error.  The  dark  square  and  hourglass  ( often 
superimposed  because  of  the  common  tilt  platform) 
show  left  and  right  tilt  errors,  respectively,  (a)  tracking 
only;  (b)  add  vergence  and  head  compensation;  (c)  add 
VOR 


Figure  21:  The  extended  control  algorithm  for  delayed  system 
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while  keeping  it  centered  in  the  field  of  view.  A  simple  animate  vision  technique  is 
demonstrated  by  computing  a  continuously  time  averaged  image.  The  accompanying 
Robot  movement  causes  the  background  areas  in  the  image  to  be  blurred  while  the 
object  remains  clear,  thus  demonstrating  a  simple  segmentation  technique.  Finally 
the  robot  pokes  each  ball  with  a  stick.  The  now  moving  ball  is  visually  tracked  using 
the  eye  motors  on  the  Head  (another  simple  animate  vision  idea).  The  overall  system 
is  more  fully  described  in  reports  by  Brown.  A  further  result  of  this  effort  was  a  guide 
for  other  members  of  our  group  on  “how  to  use  the  Rochester  robot”. 

Last  summer  the  WAVE  platform  was  put  to  further  use  in  a  study  of  the  problem 
of  moving  the  Head  to  view  the  front  of,  or  a  characteristic  view  of,  an  object.  The 
idea  which  we  developed  was  to  model  vision  with  a  parameter  net  model  to  model 
Head  movements  with  a  basic  PID  controller  and  then  to  study  differential  rela¬ 
tionships  between  response  patterns  in  the  parameter  net  and  the  command  signals 
sent  to  the  PID  controller.  The  parameter  net  represented  an  object  using  a  Hough 
transform  of  its  silhouette.  Nearness  to  a  characteristic  viewpoint  was  related  with  a 
distortion  measure  over  nodes  in  the  parameter  net.  The  system  was  implemented, 
but  performed  poorly.  A  similar  effort  based  on  a  color  image  approach  (Wixson’s 
work,  described  above)  performed  slightly  better. 

Over  time  WAVE  has  evolved  into  a  more  general  platform.  In  anticipation  of  mov¬ 
ing  over  to  the  Psyche  operating  system  running  on  the  Butterfly  parallel  computer, 
WAVE  was  converted  to  the  g+-f  programming  language  used  by  Psyche  and  WAVE 
was  converted  to  use  the  Zebra  system  for  programming  our  DataCube  MaxVideo 
image  processing  hardware.  Zebra  currently  works  with  the  Psyche/Butterfly  system 
as  well  as  the  original  Sun  machines.  WAVE  itself  has  not  yet  been  adapted  to  run 
under  Psyche. 

9.9  Modeling  attentional  behavior  sequences  with  an  aug¬ 
mented  hidden  Markov  model 

Selective  attention,  or  the  intelligent  application  of  limited  visual  resources,  has 
emerged  as  a  basic  topic  for  a  long-range  program  of  research  we  are  now  pursuing. 
The  concept  here  is  that  realistically  any  system  has  to  deal  with  limited  sensing  and 
computational  resources,  and  that  therefore  we  should  focus  our  study  on  (selective 
attention)  mechanisms  to  deal  with  such  limited  resource  situations. 

One  approach  is  to  map  the  visual  attention  problem  onto  sensor  allocation  prob¬ 
lems  such  as  where  to  point  a  camera  and  where  to  allocate  processing  within  a  single 
image  from  that  camera.  If  we  assume  a  spatially-variant  sensor  (such  as  one  with  a 
small,  high-resolution  fovea  and  a  large,  low- resolution  periphery)  one  specific  prob¬ 
lem  is  to  decide  what  sequence  of  eye  movements  to  make  to  selectively  position  the 
fovea  in  the  scene.  One  aspect  of  the  work  attacks  the  specific  problem  of  modeling 
foveation  sequences.  In  most  treatments  of  this  subject,  a  sequence  of  eye  movements 
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emerges  as  a  result  of  sequential  cognitive  effort  and  image  analysis,  and  is  not  ex¬ 
plicitly  represented.  We  decided  to  augment  the  usual  paradigm  with  a  new  explicit 
representation  of  probabilistic  but  task-dependent  attentional  sequencing.  Explicit 
sequences  are  something  like  motor  skills ;  they  efficiently  capture  the  effect  of  much 
cognitive  activity  and  feedback-mediated  behavior ,  and  allow  it  to  be  generated  quickly 
with  low  cognitive  overhead. 

The  explicit  representation  is  an  augmented  hidden  Markov  model  (AHMM).  A 
simple  hidden  Markov  model  can  learn  an  emergent  behavior  and  re-generate  it  as 
an  explicit  data-oblivious  sequence.  An  AHMM  incorporates  a  feedback  sequence 
to  modify  the  generated  sequence.  It  can  therefore  relearn  or  constantly  modify  its 
own  (feedback  modified)  explicit  behavior,  thus  adapting  to  varying  conditions.  Two 
AHMM  models  have  been  developed,  the  first  model  uses  a  simple  external  feedback 
loop,  the  second  model  uses  internal  feedback  which  modifies  the  internal  parameters 
(probabilities)  of  the  AHMM  thus  effecting  the  generation  likelihoods  directly.  This 
work  has  been  experimentally  verified  using  the  capabilities  of  WAVE  and  the  results 
are  encouraging  [Rimey  and  Brown  1990]. 


10  Planning  in  a  Parallel  System 

We  have  been  exploring  ways  of  forming  and  executing  strategies  that  involve  se¬ 
quences  of  primitive  behaviors.  Actions  and  perception  are  the  only  realistic  way 
to  bring  computerized  decision-making  and  planning  into  contact  with  reality.  This 
“planning”  capability  is  necessary  for  systems  that  are  to  be  more  than  reflexive  [Feist 
1989a, b],  and  which  must  solve  problems  and  make  decisions  about  what  to  do  next 
[Allen  and  Pelavin  1986;  Allen  et  al.  1990].  Making  such  decisions  with  uncertain 
information  under  time  constraints  is  beyond  the  current  state  of  the  art,  although 
decision-making  under  uncertainty,  reasoning  about  actions  through  time  [Allen  1989; 
Allen  and  Hayes  1987],  and  in  general  the  questions  of  what  to  believe  and  what  to 
do  next  pervade  all  of  intelligent  behavior.  At  Rochester,  these  questions  are  being 
investigated  in  the  context  of  ARMTRAK  [Martin  et  al.  1990]  ,  a  micro-world  un¬ 
der  development,  based  on  the  control  of  model  trains,  designed  to  integrate  work  in 
natural  language,  planning,  vision,  and  robotics. 

Two  versions  of  ARMTRAK  have  been  implemented:  a  simulation  and  a  set  of 
trains  coupled  to  the  sensors  associated  with  the  Rochester  Robot.  The  simulation 
allows  rapid  prototyping  of  planners  and  experimentation  with  problems  posed  by 
different  layouts.  Simulations  invariably  involve  simplifying  assumptions,  however,  so 
the  real  trains  and  sensors  in  the  vision  lab  allow  us  the  rare  opportunity  of  running 
a  symbolic  planner  in  the  real  world.  The  train  controller  has  been  wired  so  that  the 
switchyard  can  be  operated  from  outside  the  robot  room.  The  vision  routines  are 
able  to  recognize  the  existence  of  a  moving  train  in  its  field  of  view  and  are  able  to 
determine  the  state  of  a  switch  in  its  field  of  view.  The  robot  also  knows  the  locations 
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of  the  switches,  so  it  can  position  itself  to  observe  them.  Despite  its  potential,  the 
ARMTRAK  implementation  is  currently  a  demonstration  of  concept  only.  It  does 
not  have  a  smooth  interface  between  the  LISP  world,  where  all  the  work  on  planning 
takes  place,  and  the  C  environment,  where  the  vision  work  is  implemented.  Our  goal 
is  to  support  LISP  on  our  multiprocessor,  and  to  have  shared  data  structures  linking 
the  symbolic  reasoner  and  the  perception  and  action  components  of  the  system,  which 
themselves  will  rely  on  the  integrated  soft  and  hard  read-time  subsystems  mentioned 
above. 

For  ARMTRAK  and  other  similar  systems  of  the  future,  we  would  like  to  pro¬ 
vide  a  solid  substrate  of  visuo-motor  behaviors  and  primitive  capabilities,  baaed  on 
well-understood  real-time  technology.  The  user  of  these  capabilities  should  not  have 
to  think  about  the  details  of  their  operation.  Likewise,  primitives  for  cooperation, 
preemption,  and  parallel  operation  of  these  low-level  capabilities  should  be  provided: 
a  smooth  integration  of  hard  and  soft  read-time  systems  is  an  important  aspect  of  this 
work. 

In  addition  to  our  ARMTRAK  work,  our  studies  of  learning  algorithms  have 
revealed  ways  of  learning  correct  primitive  sequences  by  triad  and  error  or  training 
[Whitehead  and  Ballard  1990;  Rimey  and  Brown  1990].  This  work  suggests  ways 
that  systems  can  learn  to  adapt  behaviors  in  complex  environments  and  lays  the 
groundwork  for  building  systems  that  satisfice. 


1 1  Technology  Transfer 

Under  the  contract  Rochester  developed  large  amounts  of  Butterfly  applications  soft¬ 
ware,  the  Connectionist  Simulator,  and  the  Zebra/Zed  system  for  object-oriented 
register  level  programming.  The  simulator  and  Zebra/Zed  are  available  by  anony¬ 
mous  ftp  or  magnetic  media,  and  hundreds  of  copies  have  been  sent  out  worldwide. 

Rochester  has  a  substantial  Industrial  Affiliates  Program,  with  industrial  partners 
including  BBN,  GE,  Kodak,  and  Xerox.  In  the  recent  past,  we  have  had  active 
research  collaboration  in  the  areas  of  vision,  reasoning,  and  parallel  programming 
environments  with  each  of  these  affiliates.  We  have  an  annual  meeting  to  keep  affiliates 
abreast  of  our  work,  and  to  keep  them  aware  of  students  and  personnel  here  with 
whom  they  may  have  interests  in  common.  Rochester  students  normally  spend  one  or 
two  summer  terms  working  in  industry,  and  the  resulting  ties  with  IBM,  GE  Research, 
GM  Research,  AT&T,  and  Xerox  (both  PARC  and  Webster  Research  Centers)  are 
healthy  and  strong.  These  couplings  are  often  demonstrated  in  observable  product 
(the  indefinite  loam  of  the  IBM  8CE  computer  to  Fowler,  the  joint  publications  of 
Swain  and  J.L.  Mundy  of  GE  Research,  etc.). 

Rochester  participated  in  the  first  DARPA  parallel  vision  architecture  benchmark, 
and  the  resulting  applications  software  (as  well  as  the  other  programming  libraries  and 
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facilities  we  have  developed),  are  disseminated  through  BBN.  Rochester’s  large  and 
well-subscribed  technical  reports  service  distributes  reports  to  hundreds  of  industrial 
and  academic  sights  monthly. 

There  is  evidence  that  scientific  papers  have  transferred  some  of  the  technology 
successfully:  the  Instant  Replay  system  was  implemented  on  Sequent  computers  by 
a  group  in  Germany,  for  example.  Through  an  international  computer  newsgroup 
the  expertise  on  the  DataCube  pipelined  processor  is  both  shared  and  acquired.  The 
Rochester  Connectionist  Simulator  and  the  Zebra/Zed  systems  are  available  by  anony¬ 
mous  ftp.  Together  they  have  been  distributed  to  several  hundred  sites  worldwide. 


12  Thesis  Abstracts 

Several  theses  appeared  during  the  contract  period  that  were  directly  related  to  the 
contract.  Many  more  were  initiated  during  the  contract  period  and  have  been  com¬ 
pleted  since,  or  are  still  (1990)  in  process.  The  following  are  representative  of  earlier 
work  under  the  contract. 

Aloimonos,  J.,  “Computing  intrinsic  images,”  Ph.D.  Thesis  and  TR  198,  August 
1986:  Several  theories  have  been  proposed  in  the  literature  for  the  computation  of 
shape  from  shading,  shape  from  texture,  retinal  motion  from  spatiotemporal  deriva¬ 
tives  of  the  image  intensity  function,  and  the  like.  However:  (1)  The  employed 
assumptions  are  not  present  in  a  large  subset  of  real  images.  (2)  Usually  the  natu¬ 
ral  constraints  guarantee  unique  answers,  calling  for  strong  additional  assumptions 
about  the  world.  (3)  Even  if  physical  constraints  guarantee  unique  answers,  often  the 
resulting  algorithms  are  not  robust.  This  thesis  shows  that  if  several  available  cues 
are  combined,  then  the  resulting  algorithms  compute  intrinsic  parameters  (shape, 
depth,  motion,  etc.)  uniquely  and  robustly.  The  computational  aspect  of  the  theory 
envisages  a  cooperative  highly  parallel  implementation,  bringing  ip  information  from 
five  different  sources  (shading,  texture,  motion,  contour  and  stereo),  to  resolve  ambi¬ 
guities  and  ensure  uniqueness  and  stability  of  the  intrinsic  parameters.  The  problems 
of  shape  from  texture,  shape  from  shading  and  motion,  visual  motion  analysis,  and 
shape  and  motion  from  contour  are  analyzed  in  detail. 

Bandopadhay,  A.,  “A  computational  study  of  rigid  motion  perception,  ”  Ph.D.  The¬ 
sis  and  TR  221,  December  1986:  The  interpretation  of  visual  motion  is  investigated. 
The  task  of  motion  perception  is  divided  into  two  major  subtasks:  (1)  estimation  of 
two-dimensional  retinal  motion,  and  (2)  computation  of  parameters  of  rigid  motion 
from  retinal  motion.  Retinal  motion  estimation  is  performed  using  a  point  matching 
algorithm  based  on  local  similarity  of  matches  and  a  global  clustering  strategy.  The 
clustering  technique  unifies  the  notion  of  matching  and  motion  segmentation  and  pro¬ 
vides  an  insight  into  the  complexity  of  the  matching  and  segmentation  process.  The 
constraints  governing  the  computation  of  the  rigid  motion  parameters  from  retinal 
motion  are  investigated.  The  emphasis  is  on  determining  the  possible  ambiguities  of 
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interpretation  and  how  to  remove  them.  This  theoretical  analysis  forms  the  basis  of  a 
set  of  algorithms  for  computing  structure  and  three-dimensional  motion  parameters 
from  retinal  displacements.  The  algorithms  are  experimentally  evaluated.  The  main 
difficulties  facing  the  computation  are  nonlinearity  and  a  high-dimensional  search 
space  of  solutions.  To  alleviate  these  difficulties,  an  active  tracking  method  is  pro¬ 
posed.  This  is  a  closed  loop  system  for  evaluating  the  motion  parameters.  Under 
such  a  regime,  it  is  possible  to  obtain  form  solutions  for  the  motion  parameters.  This 
leads  to  a  robust  cooperative  algorithm  for  motion  perception  requiring  a  minimal 
amount  of  retinal  motion  matching.  The  central  theme  for  this  research  has  been 
the  evaluation  of  a  hierarchical  model  for  visual  motion  perception.  To  this  end, 
the  investigations  revolved  around  three  primary  issues:  (1)  retinal  motion  computa¬ 
tion  from  intensity  images;  (2)  the  conditions  under  which  three-dimensional  motion 
may  be  computed  from  retinal  motion,  and  the  efficacy  of  algorithms  that  perform 
such  computations;  (3)  the  active  vision  or  closed  loop  approach  to  visual  motion 
interpretation  and  what  it  buys  us. 

Chou,  P.  B.-L.,  “ The  theory  and  practice  of  Bayesian  image  labeling  Ph.D.  The¬ 
sis  and  TR  258,  August  1988:  Integrating  disparate  sources  of  information  has  been 
recognized  as  one  of  the  keys  to  the  success  of  general  purpose  vision  systems.  Image 
clues  such  as  shading,  texture,  stereo  disparities  and  image  flows  provide  uncertain, 
local  and  incomplete  information  about  the  three-dimensional  scene.  Spatial  a  priori 
knowledge  plays  the  role  of  filling  in  missing  information  and  smoothing  out  noise. 
This  thesis  proposes  a  solution  to  the  longstanding  open  problem  of  visual  integra¬ 
tion.  It  reports  a  framework,  based  on  Bayesian  probability  theory,  for  computing 
an  intermediate  representation  of  the  scene  from  disparate  sources  of  information. 
The  computation  is  formulated  as  a  labeling  problem.  Local  visual  observations  for 
each  image  entity  are  reported  as  label  likelihoods.  They  are  combined  consistently 
and  coherently  on  hierarchically  structured  label  trees  with  a  new,  computationally 
simple  procedure.  The  pooled  label  likelihoods  are  fused  with  the  a  priori  spatial 
knowledge  encoded  as  Markov  Random  Fields  (MRFs).  The  a  posteriori  distribution 
of  the  labelings  are  thus  derived  in  a  Bayesian  formalism.  A  new  inference  method, 
Highest  Confidence  First  (HCF)  estimation,  is  used  to  infer  a  unique  labeling  from 
the  a  posteriori  distribution.  Unlike  previous  inference  methods  based  on  the  MRF 
formalism,  HCF  is  computationally  efficient  and  predictable  while  meeting  the  prin¬ 
ciples  of  graceful  degradation  and  least  commitment.  The  results  of  the  inference 
process  are  consistent  with  both  observable  evidence  and  a  priori  knowledge.  The  ef¬ 
fectiveness  of  the  approach  is  demonstrated  with  experiments  on  two  image  analysis 
problems:  intensity  edge  detection  and  surface  reconstruction.  For  edge  detection, 
likelihood  outputs  from  a  set  of  local  edge  operators  are  integrated  with  a  priori 
knowledge  represented  as  an  MRF  probability  distribution.  For  surface  reconstruc¬ 
tion,  intensity  information  is  integrated  with  sparse  depth  measurements  and  a  priori 
knowledge.  Coupled  MRFs  provide  a  unified  treatment  of  surface  reconstruction  and 
segmentation,  and  an  extension  of  HCF  implements  a  solution  method.  Experiments 


54 


using  real  image  and  depth  data  yield  robust  results.  The  framework  can  also  be 
generalized  to  higher-level  vision  problems,  as  well  as  to  other  domains. 

Dibble,  P.C.,  “A  Parallel  Interleaved  File  System ,”  Ph.D.  Thesis  and  TR  834, 
March  1990:  A  computer  system  is  most  useful  when  it  has  well-balanced  processor 
and  I/O  performance.  Parallel  architectures  allow  fast  computers  to  be  constructed 
from  unsophisticated  hardware.  The  usefulness  of  these  machines  is  severely  limited 
unless  they  are  fitted  with  I/O  subsystems  that  match  their  CPU  performance.  Most 
parallel  computers  have  insufficient  I/O  performance,  or  use  exotic  hardware  to  force 
enough  I/O  bandwidth  through  a  uniprocessor  file  system.  This  approach  is  only 
useful  for  small  numbers  of  processors.  Even  a  modestly  parallel  computer  cannot  be 
served  by  an  ordinary  file  system.  Only  a  parallel  file  system  can  scale  with  the  pro¬ 
cessor  hardware  to  meet  the  I/O  demands  of  a  parallel  computer.  This  dissertation 
introduces  the  concept  of  a  parallel  interleaved  file  system.  This  class  of  file  system 
incorporates  three  concepts:  parallelism,  interleaving,  and  tools.  Parallelism  appears 
as  a  characteristic  of  the  file  system  program  and  in  the  disk  hardware.  The  parallel 
file  system  software  and  hardware  allows  the  file  system  to  scale  with  the  other  com¬ 
ponents  of  a  multiprocessor  computer.  Interleaving  is  the  rule  the  file  system  uses  to 
distribute  data  among  the  processors.  Interleaved  record  distribution  is  the  simplest 
and  in  many  ways  the  best  algorithm  for  allocating  records  to  processors.  Tools  are 
application  code  that  can  enter  the  file  system  at  a  level  that  exposes  the  parallel 
structure  of  the  files.  In  many  cases  tools  decrease  interprocessor  communication  by 
moving  processing  to  the  data  instead  of  moving  the  data.  The  thesis  of  this  disser¬ 
tation  is  that  a  parallel  interleaved  file  system  will  provide  scalable  high-performance 
I/O  for  a  wide  range  of  parallel  architectures  while  supporting  a  comprehensive  set 
of  conventional  file  system  facilities.  We  have  confirmed  our  performance  claims  ex¬ 
perimentally  and  theoretically.  Our  experiments  show  practically  linear  speedup  to 
the  limits  of  our  hardware  for  file  copy,  file  sort,  and  matrix  transpose  on  an  array  of 
bits  stored  in  a  file.  Our  analysis  predicts  the  measured  results  and  supports  a  claim 
that  the  file  system  will  easily  scale  to  more  than  128  processors  with  disk  drives. 

Floyd,  R.A.,  “ Transparency  in  distributed  file  systems,”  Ph.D.  Thesis  and  TR 
272,  January  1989:  The  last  few  years  have  seen  an  explosion  in  the  research  and 
development  of  distributed  file  systems.  Existing  systems  provide  a  limited  degree 
of  network  transparency,  with  researchers  generally  arguing  that  full  network  trans¬ 
parency  in  unachievable.  Attempts  to  understand  and  address  these  arguments  have 
been  limited  by  a  lack  of  understanding  of  the  range  of  possible  solutions  to  trans¬ 
parency  issues  and  a  lack  of  knowledge  of  the  ways  in  which  file  systems  are  used.  We 
address  these  problems  by:  (1)  designing  and  implementing  a  prototype  of  a  highly 
transparent  distributed  file  system;  (2)  collecting  and  analyzing  data  on  file  and  di¬ 
rectory  reference  patterns;  and  (3)  using  these  data  to  analyze  the  effectiveness  of 
our  design.  Our  distributed  file  system,  Roe,  supports  a  substantially  higher  degree 
of  transparency  than  earlier  distributed  file  systems,  and  is  able  to  do  this  in  a  het¬ 
erogeneous  environment.  Roe  appears  to  users  to  be  a  single,  globally  accessible  file 
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system  providing  highly  available,  consistent  files.  It  provides  a  coherent  framework 
for  uniting  techniques  in  the  areas  of  naming,  replication,  consistency  control,  file 
and  directory  placement,  and  file  and  directory  migration  in  a  way  that  provides  full 
network  transparency.  This  transparency  allows  Roe  to  provide  increased  availability, 
automatic  reconfiguration,  effective  use  of  resources,  a  simplified  file  system  model, 
and  important  performance  benefits.  Our  data  collection  and  analysis  work  provides 
detailed  information  on  short-term  file  reference  patterns  in  the  UNIX  environment. 
In  addition  to  examining  the  overall  request  behavior,  we  break  references  down  by 
the  type  of  file,  owner  of  file,  and  type  of  user.  We  find  significant  differences  in  ref¬ 
erence  patterns  between  the  various  classes  that  can  be  used  as  a  basis  for  placement 
and  migration  algorithms.  Our  study  also  provides,  for  the  first  time,  information  on 
directory  reference  patterns  in  a  hierarchical  file  system.  The  results  provide  striking 
evidence  of  the  importance  of  name  resolution  overhead  in  UNIX  environments.  Us¬ 
ing  our  data  collection  analysis  results,  we  examine  the  availability  and  performance 
of  Roe.  File  open  overhead  proves  to  be  an  issue,  but  techniques  exist  for  reducing 
its  impact. 

Friedberg,  S.A.,  “Hierarchical  process  composition:  Dynamic  maintenance  of  struc¬ 
ture  in  a  distributed  environment,”  Ph.D.  Thesis  and  TR  294,  1988:  This  disserta¬ 
tion  is  a  study  in  depth  of  a  method,  called  hierarchical  process  composition  (HPC), 
for  organizing,  developing,  and  maintaining  large  distributed  programs.  HPC  extends 
the  process  abstraction  to  nested  collections  of  processes,  allowing  a  multiprocess  pro¬ 
gram  in  place  of  any  single  process,  and  provides  a  rich  set  of  structuring  mechanisms 
for  building  distributed  applications.  The  emphasis  in  HPC  is  on  structural  and 
architectural  issues  in  distributed  software  systems,  especially  interactions  involving 
dynamic  reconfiguration,  protection,  and  distribution.  The  major  contributions  of 
this  work  come  from  the  detailed  consideration,  based  on  case  studies,  formal  analy¬ 
sis,  and  a  prototype  implementation,  of  how  abstraction  and  composition  interact  in 
unexpected  ways  with  each  other  and  with  a  distributed  environment.  HPC  ties  pro¬ 
cesses  together  with  heterogeneous  interprocess  communication  mechanisms,  such  as 
TCP/IP  and  remote  procedure  call.  Explicit  structure  determines  the  logical  connec¬ 
tivity  between  processes,  masking  differences  in  communication  mechanisms.  HPC 
supports  one-to-one,  parallel  channel,  and  many-to-many  (multicasting)  connectivity. 
Efficient  computation  of  end-to-end  connectivity  from  the  communication  structure  is 
a  challenging  problem,  and  a  third-party  connection  facility  is  needed  to  implement 
dynamic  reconfiguration  when  the  logical  connectivity  changes.  Explicit  structure 
also  supports  grouping  and  nesting  of  processes.  HPC  uses  this  process  structure 
to  define  meaningful  protection  domains.  Access  control  is  structured  (and  the  ba¬ 
sic  HPC  facilities  may  be  extended)  using  the  same  powerful  tools  used  to  define 
communication  patterns.  HPC  provides  escapes  from  the  strict  hierarchy  for  direct 
communication  between  any  two  programs,  enabling  transparent  access  to  global  ser¬ 
vices.  These  escapes  are  carefully  controlled  to  prevent  interference  and  to  preserve 
the  appearance  of  a  strict  hierarchy.  This  work  is  also  a  rare  case  study  in  consis- 
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tency  control  for  non-trivial,  highly- available  services  in  a  distributed  environment. 
Since  HPC  abstraction  and  composition  operations  must  be  available  during  network 
partitions,  basic  structural  constraints  can  be  violated  when  separate  partitions  are 
merged.  By  exhaustive  case  analysis,  all  possible  merge  inconsistencies  that  could 
arise  in  HPC  have  been  identified  and  it  is  shown  how  each  inconsistency  can  be 
either  avoided,  automatically  reconciled  by  the  system,  or  reported  to  the  user  for 
application-specific  reconciliation. 

Lout,  R.P.,  “ Theory  and  computation  of  uncertain  inference  and  decision,”  Ph.D. 
Thesis  and  TR  228,  September  198 7:  This  interdisciplinary  dissertation  studies  un¬ 
certain  inference  pursuant  to  the  purposes  of  artificial  intelligence,  while  following 
the  tradition  of  philosophy  of  science.  Its  major  achievement  is  the  extension  and 
integration  of  work  in  epistemology  and  knowledge  representation.  This  results  in 
both  a  better  system  for  evidential  reasoning  and  a  better  system  for  qualitative 
non-monotonic  reasoning.  By  chapter,  the  contributions  are:  a  comparison  of  non¬ 
monotonic  and  inductive  logic;  the  effective  implementation  of  Kyburg’s  indetermi¬ 
nate  probability  system;  an  extension  of  that  system;  a  proposal  for  decision-making 
with  indeterminate  probabilities;  a  system  of  non-monotonic  reasoning  motivated  by 
the  study  of  probabilistic  reasoning;  some  consequences  of  this  system;  a  convention- 
alistic  foundation  for  decision  theory  and  non-monotonic  reasoning. 

Mellor-Crummey,  J.,  “Debugging  and  analysis  of  large-scale  parallel  programs,” 
Ph.D.  Thesis  and  TR  312,  September  1989:  One  of  the  most  serious  problems  in  the 
development  cycle  of  large-scale  parallel  programs  is  the  lack  of  tools  for  debugging 
and  performance  analysis.  Parallel  programs  are  more  difficult  to  analyze  than  their 
sequential  counterparts  for  several  reasons.  First,  race  conditions  in  parallel  programs 
can  cause  non-deterministic  behavior,  which  reduces  the  effectiveness  of  traditional 
cyclic  debugging  techniques.  Second,  invasive,  interactive  analysis  can  distort  a  par¬ 
allel  program’s  execution  beyond  recognition.  Finally,  comprehensive  analysis  of  a 
parallel  program’s  execution  requires  collection,  management,  and  presentation  of 
an  enormous  amount  of  information.  This  dissertation  addresses  the  problem  of 
debugging  and  analysis  of  large-scale  parallel  programs  executing  on  shared-memory 
multiprocessors.  It  proposes  a  methodology  for  top-down  analysis  of  parallel  program 
executions  that  replaces  previous  ad-hoc  approaches.  To  support  this  methodology, 
a  formal  model  for  shared-memory  communication  among  processes  in  a  parallel  pro¬ 
gram  is  developed.  It  is  shown  how  synchronization  traces  based  on  this  abstract 
model  can  be  used  to  create  indistinguishable  executions  that  form  the  basis  for  de¬ 
bugging.  This  result  is  used  to  develop  a  practical  technique  for  tracing  parallel 
program  executions  on  shared-memory  parallel  processors  so  that  their  executions 
can  be  repeated  deterministically  on  demand.  Next,  it  is  shown  how  these  traces  can 
be  augmented  with  additional  information  that  increases  their  utility  for  debugging 
and  performance  analysis.  The  design  of  an  integrated,  extensible  toolkit  based  on 
these  traces  is  proposed.  This  toolkit  uses  execution  traces  to  support  interactive, 
graphics-based,  top-down  analysis  of  parallel  program  executions.  A  prototype  imple- 
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mentation  of  the  toolkit  is  described  explaining  how  it  exploits  our  execution  tracing 
model  to  facilitate  debugging  and  analysis.  Case  studies  of  the  behavior  of  several 
versions  of  two  parallel  programs  are  presented  to  demonstrate  both  the  utility  of  our 
execution  tracing  model  and  the  leverage  it  provides  for  debugging  and  performance 
analysis. 

Olson,  T.J.,  “An  architectural  model  of  visual  motion  understanding Ph.D.  The¬ 
sis  and  TR  305,  August  1989:  The  past  few  years  have  seen  an  explosion  of  interest 
in  the  recovery  and  use  of  visual  motion  information  by  biological  and  machine  vision 
systems.  In  the  area  of  computer  vision,  a  variety  of  algorithms  have  been  developed 
for  extracting  various  types  of  motion  information  from  images.  Neuroscientists  have 
made  great  strides  in  understanding  the  flow  of  motion  information  from  the  retina 
to  striate  and  extrastriate  cortex.  The  psychophysics  community  has  gone  a  long 
way  toward  characterizing  the  limits  and  structure  of  human  motion  processing.  The 
central  claim  of  this  thesis  is  that  many  puzzling  aspects  of  motion  perception  can  be 
understood  by  assuming  a  particular  architecture  for  the  human  motion  processing 
system.  The  architecture  consists  of  three  functional  units  or  subsystems.  The  first  or 
low-level  subsystem  computes  simple  mathematical  properties  of  the  visual  signed.  It 
is  entirely  bottom-up,  and  prone  to  error  when  its  implicit  assumptions  are  violated. 
The  intermediate-level  subsystem  combines  the  low-level  system’s  output  with  world 
knowledge,  segmentation  information  and  other  inputs  to  construct  a  representation 
of  the  world  in  terms  of  primitive  forms  and  their  trajectories.  It  is  claimed  to  be 
the  substrate  for  long-range  apparent  motion.  The  highest  level  of  the  motion  system 
assembles  intermediate-level  form  and  motion  primitives  into  scenarios  that  can  be 
used  for  prediction  and  for  matching  against  stored  models.  This  architecture  is  the 
result  of  joint  work  with  Jerome  Feldman  and  Nigel  Goddard.  The  description  of  the 
low-level  system  is  in  accord  with  the  standard  view  of  early  motion  processing,  and 
the  details  of  the  high-level  system  are  being  worked  out  by  Goddard.  The  secondary 
contribution  of  this  thesis  is  a  detailed  connectionist  model  of  the  intermediate  level 
of  the  architecture.  In  order  to  compute  the  trajectories  of  primitive  shapes  it  is 
necessary  to  design  mechanisms  for  handling  time  and  Gestalt  grouping  effects  in 
connectionist  networks.  Solutions  to  these  problems  are  developed  and  used  to  con¬ 
struct  a  network  that  interprets  continuous  and  apparent  motion  stimuli  in  a  limited 
domain.  Simulation  results  show  that  its  interpretations  are  in  qualitative  agreement 
with  human  perception. 

Shastri,  L.,  “ Evidential  reasoning  in  semantic  networks:  A  formal  theory  and  its 
parallel  implementation,”  Ph.D.  Thesis  and  TR  166,  September  1985:  This  the¬ 
sis  describes  an  evidential  framework  for  representing  conceptual  knowledge,  wherein 
the  principle  of  maximum  entropy  is  applied  to  deal  with  uncertainty  and  incomplete¬ 
ness.  It  is  demonstrated  that  the  proposed  framework  offers  a  uniform  treatment  of 
inheritance  and  categorization,  and  solves  an  interesting  class  of  inheritance  and  cate¬ 
gorization  problems,  including  those  that  involve  exceptions,  multiple  hierarchies,  and 
conflicting  information.  The  proposed  framework  can  be  encoded  as  an  interpreter- 
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free,  massively  parallel  (connectionist)  network  that  can  solve  the  inheritance  and 
categorization  problems  in  time  proportional  to  the  depth  of  the  conceptual  hierar¬ 
chy. 

Sher,  D.B.,  “A  probabilistic  approach  to  low-level  vision,”  Ph.D.  Thesis  and  TR 
232,  October  1987:  A  probabilistic  approach  to  low-level  vision  algorithms  results 
in  algorithms  that  are  easy  to  tune  for  a  particular  application  and  modules  that 
can  be  used  for  many  applications.  Several  routines  that  return  likelihoods  can  be 
combined  into  a  single  more  robust  routine.  Thus  it  is  easy  to  construct  specialized 
yet  robust  low-level  vision  systems  out  of  algorithms  that  calculate  likelihoods.  This 
dissertation  studies  algorithms  that  generate  and  use  likelihoods.  Probabilities  derive 
from  likelihoods  using  Bayes’s  rule.  Thus  vision  algorithms  that  return  likelihoods 
also  generate  probabilities.  Likelihoods  axe  used  by  Markov  Random  Field  algorithms. 
This  approach  yields  facet  model  boundary  pixel  detectors  that  return  likelihoods. 
Experiments  show  that  the  detectors  designed  for  the  step  edge  model  are  on  par  with 
the  best  edge  detectors  reported  in  the  literature.  Algorithms  are  presented  here  that 
use  the  generalized  Hough  transform  to  calculate  likelihoods  for  object  recognition. 
Evidence,  represented  as  likelihoods,  from  several  detectors  that  view  the  same  data 
with  different  models  are  combined  here.  The  likelihoods  that  result  are  used  to  build 
robust  detectors. 
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